Short-Term Wind Speed Prediction via Sample Entropy: A Hybridisation Approach against Gradient Disappearance and Explosion

Sivhugwana, Khathutshelo Steven; Ranganai, Edmore

doi:10.3390/computation12080163

Open AccessArticle

Short-Term Wind Speed Prediction via Sample Entropy: A Hybridisation Approach against Gradient Disappearance and Explosion

by

Khathutshelo Steven Sivhugwana

^* and

Edmore Ranganai

Department of Statistics, University of South Africa, Florida Campus, Johannesburg 1709, South Africa

^*

Author to whom correspondence should be addressed.

Computation 2024, 12(8), 163; https://doi.org/10.3390/computation12080163

Submission received: 25 May 2024 / Revised: 7 August 2024 / Accepted: 7 August 2024 / Published: 12 August 2024

(This article belongs to the Special Issue Signal Processing and Machine Learning in Data Science)

Download

Browse Figures

Versions Notes

Abstract

:

High-variant wind speeds cause aberrations in wind power systems and compromise the effective operation of wind farms. A single model cannot capture the inherent wind speed randomness and complexity. In the proposed hybrid strategy, wavelet transform (WT) is used for data decomposition, sample entropy (SampEn) for subseries complexity evaluation, neural network autoregression (NNAR) for deterministic subseries prediction, long short-term memory network (LSTM) for complex subseries prediction, and gradient boosting machine (GBM) for prediction reconciliation. The proposed WT-NNAR-LSTM-GBM approach predicts minutely averaged wind speed data collected at Southern African Universities Radiometric Network (SAURAN) stations: Council for Scientific and Industrial Research (CSIR), Richtersveld (RVD), Venda, and the Namibian University of Science and Technology (NUST). For comparison purposes, in WT-NNAR-LSTM-GBM, LSTM and NNAR are respectively replaced with a k-nearest neighbour (KNN) to form the corresponding hybrids: WT-NNAR-KNN-GBM and WT-KNN-LSTM-GBM. We assessed WT-NNAR-LSTM-GBM’s efficacy against NNAR, LSTM, WT-NNAR-KNN-GBM, and WT-KNN-LSTM-GBM as well as the naïve model. The comparative study found that the WT-NNAR-LSTM-GBM model was the most accurate, sharpest, and robust based on mean absolute error, median absolute deviation, and residual analysis. The study results suggest using short-term forecasts to optimise wind power production, enhance grid operations in real-time, and open the door to further algorithmic enhancements.

Keywords:

wind speed forecasting; SampEn; LSTM; NNAR; WT-NNAR-LSTM-GBM

1. Introduction

1.1. Overview

While wind is a clean energy resource abundant in Southern Africa, harnessing its power is a complex and specialised task. Even so, the high-variant behaviour of wind speed causes aberrations in the wind power system, which compromises the effective operation of wind farms and the integration of large volumes of wind power into the power grid [1,2,3,4,5]. On the other hand, reliable and accurate wind power forecasts are crucial to increasing the penetration of wind power into electric grids. For instance, energy utilities and system operators require accurate short-term wind power forecasting information for real-time grid operations and regulation actions [1]. Thus, effective integration of wind energy into existing power grids is heavily reliant on the precision of wind power predictions. Since wind power is highly dependent on wind speed, accurate wind power forecasts can in turn be achieved by predicting wind speed accurately [2].

1.2. Literature Review

Forecasting strategies for wind power are divided into four categories: physical, statistical, machine learning, and hybrid methods [3,6,7,8,9,10,11]. However, physical methods (such as numerical weather prediction (NWP)) are computationally expensive and inaccurate at short time scales [6,12,13], while the latter three are appropriate for short-term forecasting. Thus, physical methods are reserved for medium-term and long-term forecasting. The most widely used statistical method for wind speed short-term forecasting is the class of the well-known Box–Jenkins (1976) autoregressive moving average (ARMA) models which are based on historical data. Their appeal derives from their simplicity, ability to capture linear characteristics inherent in the data, and their high degree of prediction accuracy at a short-term horizon (see, e.g., [3] for details). However, ARMAs are unable to effectively capture the nonlinearity typically inherent in wind speed data. With the advancement of technology in recent years, machine learning methods such as artificial neural networks (ANNs) have gained popularity [6] in time series forecasting. In addition to identifying trends/patterns, these methods can automate various decision-making processes and can effectively capture nonlinear characteristics in wind speed series that cannot be adequately captured by statistical methods [6]. Hybrid models combine various models, such as statistical and machine learning methods, to form a new and improved models (see, e.g., [12,13]). By retaining the advantages of each technique, hybrid modelling improves forecast accuracy.

In the literature, various ANNs have been discussed, including feed-forward neural networks (FNNs) [11,12,14], convolutional neural networks (CNNs) [9], and recurrent neural networks (RNNs) [6,9,12,13]. As a result of their ability to capture nonlinearity and thereby achieving high forecasting accuracy abilities, ANNs have enjoyed preference over the ARMA Box–Jenkins (1976) models over the past decades [14]. The FFNN is an ANN derived from statistical methods [12,14]. There are no cycles in these simple architectures, and data are only channelled in one forward direction [12,15]. Wind speed forecasts are dependent on the current inputs as well as the continuity and dependence of input datasets [5]. FFNNs lack historical dependencies since their input neurons are independent [12,13]. Hence, FFNNs have a slow convergence rate and are easily overfitted to unstable and noisy data [16].

RNNs are another variant of ANNs with a unit internal structure that carries the memory of the historical information of the time series [7,9,13,16]. In contrast to FFNNs, this deep learning technique uses recurrent neural connections to handle highly variant time series. Although RNNs (to some extent) can learn time relationships or long-term dependencies of time series [12,16], they are susceptible to gradient disappearance and explosion during deep propagation due to the hyperbolic tangent activation function [13,16,17]. Consequently, RNNs cannot fully learn the temporal time series behaviour, thus compromising the learning efficiency of the network [5]. To overcome this deficit, ref. [5] proposed long short-term memory (LSTM) network, which are an improved and more robust type of RNN. In contrast to FFNNs and traditional RNNs, LSTM learns temporal time-related behaviour and long-term dependencies while addressing the defects of gradient disappearance and explosion [5,12,13]. The structure of LSTMs is such that they are able to forget irrelevant information by turning multiplication into addition and preserving a long-range relationship before reaching zero to eliminate the gradient disappearance problem [9,12,13,16]. The LSTM model can also be used to detect and capture both long-term and short-term time-dependencies in time series data [6]. Hence, the LSTM network has been widely applied in various areas ranging from speech recognition and machine translation to handwriting recognition [13].

The accuracy of LSTMs can be significantly improved with pre-processing through strategies such as variational mode decomposition (VMD), empirical mode decomposition (EMD), and wavelet transformation (WT) [7,12,13,14,18,19]. In addition to having excellent localisation properties, the WT strategy can also extract patterns, discontinuities, and trends in non-stationary time series [8,20], thus capturing these localisation properties. In [14], the authors proposed a combination of the wavelet transform technique (WTT) and a two-hidden-layer neural network (TNN) (WT-TNN) for wind speed forecasting. The WTT-TNN model results showed higher accuracy in wind speed forecasting than the persistence (naïve) model (PM), one-hidden-layer neural network (ONN), and TNN [14]. In wind speed forecasting, a hybrid of WT and linear neural networks with tapped delay (LNNTD) termed WT-LNNTD was proposed by [21]. The authors discovered that the WT-LNNTD model improved the performance of individual forecasting models. Similar results were obtained in the work of [7,19], where WT and FFNNs were combined in short-term wind speed forecasting. In [13], the authors combined WT and LSTM in short-term wind speed forecasting. It was found that pre-processing through WT and then predicting each subseries using LSTM significantly improves the accuracy of the forecasts. In a similar study, ref. [5] blended WT and bidirectional-LSTM (bi-LSTM). The proposed hybrid model effectively utilised historical and future information of subseries to accurately forecast wind speed data.

Among the studies reviewed, none assessed the randomness/complexity of wavelet signals through information theory to determine the most suitable modelling and forecasting approach, which would improve models’ prediction accuracy. In fact, in the literature wind speed forecasting models are scantly optimised and enhanced by coupling sample entropy (SampEn) and wavelet decomposition. Moreover, ref. [5] argued that ensemble or combination methods should be used to account for non-linear aspects of wind speed forecasts since linear combination techniques, such as those used in [3,7,13,14], are insufficient. There is a significant scarcity of research studies that leverage the combination of the aforementioned approaches for reliable, accurate, and robust short-term wind speed predictions. The lack of information in this area has left a void in the knowledge base that needs to be addressed.

In this paper, we formulate a hybrid approach that combines the benefits of WT, SampEn, neural network autoregression (NNAR), LSTM, and gradient boosting machine (GBM) methods to form WT-NNAR-LSTM-GBM. In the proposed model, the WT is used to decompose wind speed data into less complex multiple components (approximate and detailed subseries), while the SampEn will determine the presence of complex features in each decomposed subseries. The NNAR and LSTM networks are used to independently predict deterministic (low-variant) and complex (high-variant) subseries, respectively. The nonlinear, scalable, and highly accurate GBM is used to optimally reconcile subseries forecasts to obtain final forecasts. The efficacy and robustness of the proposed model, which has not been explored in the wind speed forecasting literature to the best of our knowledge, is tested against the naïve model and four other models, namely NNAR (benchmark), LSTM, and two hybrid models. One of these hybrid models replaced LSTM with a k-nearest neighbour (KNN) to form WT-NNAR-KNN-GBM, while the other one substituted NNAR with LSTM in WT-NNAR-KNN-GBM to form WT-LSTM-KNN-GBM.

The current study addresses the shortcomings identified in the previously reviewed work by enhancing the accuracy of quantifying wind energy resources in the Southern Africa region within the short-term forecasting framework. Considering nonlinear, deterministic, and random facets of wind speed, the study proposes developing a robust, reliable, and comprehensive multi-model framework. In this way, the model will facilitate the seamless and reliable integration of significant volumes of wind energy into electrical grids.

This study uses minutely averaged wind speed data from the Council for Scientific and Industrial Research (CSIR) energy centre, GIZ Richtersveld (RVD), USAid Venda (Venda), and USAid Namibian University of Science and Technology (NUST) radiometric stations downloaded from the Southern African Universities Radiometric Network (SAURAN) website (http://www.sauran.ac.za) (accessed 12 December 2022). An R.M. Young (05103 or 03001) anemometer instrument was used to measure these high-resolution minute-based wind speed data. This instrument, which is durable, corrosion-resistant, and lightweight, has a four-blade helicoid propeller to accurately measure wind speed and a vane to determine wind direction.

1.3. Innovations and Contributions

Wind speed data are complex and volatile. A single model may not fully capture these behaviours, which, in the short term disrupt real-time grid operations, uniform wind power distribution, and output optimisation. However, these behaviours can be effectively and accurately captured (to a certain degree) using hybrid models. It is necessary to unmask the complexities as well as the deterministic and random patterns of wind speed before delving into fitting appropriate models.

The literature shows that hybrid models are generally implemented by denoising and decomposing the original signal into subseries, modelling and forecasting decomposed subseries, and then combining subseries forecasts. However, this approach often disregards other fundamental and critical elements of the subseries. For instance, when using WTs, the complexity and variability of the subseries at lower levels of decomposition are distinct from those at higher levels of decomposition. Consequently, treating these subseries similarly may compromise the accuracy of the final forecasts (see [2,3] for details). As such, it is pivotal that each subseries be independently treated, i.e., inspected and modelled relative to its inherent features. In turn, this will improve prediction accuracy. Furthermore, the majority of hybrid models also rely excessively on a linear combination of subseries of forecasts to generate the final forecast (see e.g., [4,6,14,21]). Despite being simple and efficient, this linear approach lacks accuracy and stability when combining wind speed subseries forecasts that are inherently nonlinear, leading to excessive error accumulation in the final forecast value (see [2,5] for details). It is vital to optimise a forecast combination method using other nonlinear prediction methods in various stages of the prediction process, viz., model parameter optimisation and output error correction.

The above-mentioned issues are at the core of the innovation and novelty of the hybrid strategy proposed in this paper. Hence, the main contributions of this paper are as follows:

The proposed hybrid model for wind speed forecasting is predicated upon a multi-model ensemble approach, incorporating data decomposition through WTs, complexity classification through SampEn, individual subseries modelling and prediction using NNAR and LSTM techniques, and forecast combination through GBM strategies. This model presents a comprehensive solution for wind speed forecasting, leveraging the strengths of several techniques to provide a refined and accurate forecast.
WTs play a crucial role in data transformation and decomposition, as they offer exceptional efficiency while minimising random fluctuations in data sequences, thus improving models’ prediction accuracy. As such, these techniques are highly recommended for breaking down irregular wind speed data into low- and high-frequency signals. Significantly, the decomposed signals exhibit more apparent trends and patterns with less variability than the original wind speed signal. This allows for more efficient training and prediction, to a certain extent.
The concept of randomness pertains to the frequency of distinct digits appearing in a given sequence. To gauge randomness, statistical metrics such as mean, standard deviation, skewness, kurtosis, etc., are often employed in the literature. Nevertheless, traditional approaches fall short of addressing certain complexities that emerge when attempting to examine randomness using this methodology (see [22] for details). To the contrary, the SampEn criterion enables efficient and effective classification/judgment of decomposed signals based on their analogous complex properties and deterministic properties. Consequently, the most suitable modelling and forecasting approach is employed to improve prediction accuracy.
Besides identifying patterns, NNAR models are resistant to non-stationarity and outliers. These nonlinear approximators leverage the SampEn criterion to precisely predict less random and deterministic subseries.
To curb the gradient disappearances and explosions to which NNARs are susceptible, subseries classified as more complex or highly random by the SampEn criterion are modelled and predicted using more reliable, optimised, and robust stateless LSTMs. Different from the stateful LSTM, a stateless LSTM can effectively and accurately learn patterns in unstable random time series data such as wind speed. Furthermore, stateless LSTMs are preferred over stateful LSTMs for this time series prediction task because of their higher stability, simplicity, and accuracy.
To capture the complex nonlinear structure embedded in wind speed subseries forecasts, it is imperative to employ a nonlinear forecast combination method. Therefore, a highly scalable, robust, nonlinear GBM model is preferred over a linear combination model for combining nonlinear wind speed subseries forecasts.

The study has been conducted in a manner that is reliable and reproducible, and it has provided appropriate and comprehensive assessment metrics (statistical testing) that are suitable for short-term wind speed modelling and prediction.

1.4. Paper Structure

The remainder of this paper is organised as follows: an introduction to this study’s models and the pertinent data are presented in Section 2, along with some statistical tests. The results are discussed in Section 3. This study is concluded in Section 4.

2. Materials and Methods

2.1. Data Description

In this study, minutely averaged wind speed data were downloaded from radiometric stations at CSIR in Gauteng, RVD in the Northern Cape, Venda in the Limpopo province, and NUST in Namibia. The first three (3) (CSIR, RVD, and Venda) stations are based in South Africa whilst the fourth station (NUST) is located in Namibia. These four stations of interest are described in Table 1 and represent four different seasons of the year. An R.M. Young anemometer (05103 or 03001) was used to measure high-resolution minute-based wind speeds at the four stations of interest. Intentionally, four radiometric stations of interest were chosen to assess the efficacy of the proposed predictive methods in different seasons and at sites with varying meteorological patterns (see Figure 1).

2.2. Exploratory Analysis

In Table 2, the minute averages of wind speed data cover four months representing four seasons of 2019. A total sample size of 1440 units (representing a full day) was drawn per station. Data are split into two sets per station; a training set and a testing set, with a split of 80% (1152): 20% (288), respectively. Models are trained and built using training data, while their performance is evaluated using testing data.

Table 3 presents summary statistics for minutely averaged wind speed data for the four (4) stations under investigation. The NUST and Venda are leptokurtic (kurtosis higher than 3), whilst CSIR and RVD are platykurtic (kurtosis is less than 3). The RVD data are more variant (higher standard deviation value) than the other three datasets. Furthermore, the RVD data are negatively skewed, whilst the other three datasets are positively skewed. Additionally, time series plots, boxplots, quantile-to-quantile (Q-Q) plots, and density plots reveal that data from all stations are non-normal and noisy.

2.3. Wavelet Transformation

In signal analysis, wavelets are one of the most widely used mathematical methods. When dealing with time series wind speed data that are high-variant and non-stationary, data pre-processing is critical for the application of an appropriate model [21,23]. In the literature, WT as a time–frequency decomposition technique has proven to be a very effective pre-processing method in time series forecasting (see, e.g., [10,14,16,20,21]). This is because WT can collect insightful information and simultaneously remove noise and irregular patterns from the original wind speed data [21].

The non-orthogonal maximal overlap discrete wavelet transform (MODWT) is a variant of the orthogonal discrete wavelet transform (DWT) [24,25]. MODWT decomposes signals into approximate and detailed coefficients, yields asymptotic wavelet variants, and is redundant and well-defined for all sample sizes [24,25,26,27]. To overcome the DWT defects (time-variant leading to inability to capture random delays), MODWT with zero-phase-filter-associated coefficients are time-inverse, meaning that changes in time series signals do not affect the signal pattern [24,25,26,27]. Thus,

j^{t h}

level vectors of the low-pass

\tilde{G_{j, l}} (.)

and high-pass

\tilde{H_{j, l}} (.)

filter coefficients are implemented by the MODWT to decompose the wind power series (

Y_{t} = {y_{1}, y_{2}, \dots, y_{N}}

) to the

j^{t h}

levels such that the wavelet (detailed) and scaling coefficients are given by the following respective equations [25]:

{\tilde{w}}_{j t} = \sum_{l = 0}^{L_{j} - 1} {\tilde{h}}_{j, l} y_{t - l} m o d N t \in [0, N - 1],

(1)

{\tilde{s}}_{j t} = \sum_{l = 0}^{L_{j} - 1} {\tilde{g}}_{j, l} y_{t - l} m o d N t \in [0, N - 1],

(2)

where both filter functions

\tilde{G_{j, l}} (.)

and

\tilde{H_{j, l}} (.)

have width

L_{j} = (2^{j} - 1) (L - 1) + 1

, with

L

being the width of the filter when

j = 1

. The two filter functions are quadrature mirrors. Without loss of generality, a decomposed signal of wind speed can be reconstructed by linearly combining the smooth (approximate) and detailed coefficients as follows:

Y_{t} = \sum_{k = 1}^{J} D_{k} + A_{J}

(3)

where

D_{k} = \sum_{l = 0}^{N - 1} {\tilde{h^{p}}}_{k, l} {\tilde{w}}_{k, t + l}

and

A_{J} = \sum_{l = 0}^{N - 1} {\tilde{g^{p}}}_{j, l} {\tilde{s}}_{j, t + l}

, with

{\tilde{h^{p}}}_{j, l}

and

{\tilde{g^{p}}}_{j, l}

being the respective periodised

{\tilde{h}}_{J, l} (.)

and

{\tilde{g}}_{j, l} (.)

to length

N

. The detailed coefficients (which are also high-frequency signals) are computed for each level whilst the smooth coefficients (low-frequency signals) are calculated for the

J = int (\log_{2} (N))

.

Through the function “modwt” in the R program, we decomposed the four datasets (i.e.,

J = int (\log_{2} (N)) \approx 3

) into an approximate signal (A3) and three detailed signals (D1, D2, and D3) to reveal the underlying breaks, discontinuities, and trends in the time series data. Upon decomposition, it was found that the level of variance decreases as the level of decomposition increases. Thus, subseries with lower levels of decomposition (D1 and D2) seem to have higher levels of complexity than series with higher levels of decomposition (D3 and A3) (see Figure 2).

2.4. Sample Entropy

SampEn is a modified version of approximate entropy (ApEn) and is commonly used in physiological time series signals. The rationale behind SampEn is to estimate the randomness or complexity of a series of data without any prior knowledge of the source generating the dataset [22,28]. Different from ApEn, SampEn does not include self-matching, nor does it depend much on the length of the time series [22,28]. For a time series data of length

n

,

S a m p E n

(

m, r, n

) is defined as the negative logarithm of the conditional probability that two sequences are similar for

m

points [29] within a tolerance value

r

, excluding any self-matches [30] (see [28] for detailed discussion). Suppose a vector of length N that has repeated itself at

m

points will continue to do so at

m + 1

points as well. Based on the counts of repeated times of

m + 1

points to m points, the conditional probability is determined [22]. The SampEn statistic is, therefore, calculated as follows [31]: consider two sequences of m data points donated by

Y_{m} (i) = \{y_{1}, y_{2, \dots,} y_{i + m - 1}\}

and

Y_{m} (j) = {y_{1}, y_{2, \dots,} y_{j + m - 1}}

(i, j \in [1, N - m], i \neq j)

, extracted from a vector of constant time interval

Y_{t} = {y_{1}, y_{2, \dots,} y_{N}}

to calculate maximum distance, and compared to tolerance

r

for repeated sequences counting according to the following Chebyshev distance equation (Euclidean distance function could be used as well) (also see [22,28,30]):

d [Y_{m} (i), Y_{m} (j)] = \max {| y_{i + k}, y_{j + k} |} \leq r,

(4)

where

k \in [0, m - 1]

and

(r ~ 0.2 * δ_{Y_{t}} \geq 0)

with

δ_{Y_{t}} (.)

being the standard deviation of

Y_{t}

and

m a x {| . |}

denote a Chebyshev norm such that

m a x {| y_{i + k}, y_{j + k} |} = \max_{0 \leq k \leq m - 1} {| y_{i + k} - y_{j + k} |}

. Then,

S a m p E n (m, r, n) = - l n (\frac{A}{B}),

(5)

where

A

and

B

denote the number of template vector pairs given by:

A = d [Y_{m + 1} (i), Y_{m + 1} (j)] < r

and

B = d [Y_{m} (i), Y_{m} (j)] < r

, respectively. Data with high SampEn have low probabilities of repeated sequences, resulting in lower regularity and greater complexity [22]. In other words, large values indicate that a time series signal is complex (or highly random), whilst small SampEn values indicate a deterministic time series (less randomness). In addition to showing less noise, low SampEn values are signs of high self-similarity. Entropy typically takes values between 0 and 1, but it can be greater than 1 depending on the number of classes in the data set. However, in this case, the interpretation of entropy remains unchanged. Values greater than 1 resemble an erratic series, such as irregular variations, abrupt spikes, and turbulence behaviour [32]. SampEn is highly dependent on parameter selection, especially for small data sets (i.e., n

\leq 200)

.

To measure the complexity (or randomness) of each decomposed subseries in the R program, the “sample_entropy” function within the “pracma” library was used. For all subseries, we used an embedding dimension of

m = 2

to calculate sample entropy values (also see [28]). The corresponding r values can be found in Table 4.

Table 5 shows SampEn statistics for the four (4) datasets. For all four (4) datasets, signals D1 and D2 are more variable (to a greater extent erratic) as they have SampEn much closer to or greater than 1 (i.e., the selected threshold is

\geq 0.9

) (also see Figure 2). Thus, these signals have more complex time series features. In contrast, D3 and A3 have small SampEn typical of conventional time series signals.

2.5. Neural Network Autoregression

In FFNNs, information only flows in a forward direction [15,33]. The two most common FFNNs are single-layer perceptron and multi-layer perceptron. A single-layer perceptron with only input and output layers is denoted by the following equation [15]:

y_{t} = f (\sum_{i = 0}^{v} x_{i} \cdot w_{i} + b),

(6)

where

x_{i} (i = 0,1, \dots, v)

is the input,

y_{t}

is the targeted output,

w_{i} (i = 0,1, \dots, v)

denotes the weight connecting neurons,

v

is the number of features,

f

is an activation function, and

b

is the bias term. Different from a multi-layer perceptron, a single-layer perceptron is unable to effectively learn complex patterns. Hence, single-layer perceptrons are not frequently employed in practical applications.

A multi-layer perceptron is composed of multiple (at least three) layers of computational units linked together in a feed-forward manner. A multi-layer perceptron with one input layer, one hidden layer, and one output layer, is given by [15]:

y_{t} = g (\sum_{j = 0}^{z} {w_{t j}}^{(2)} f (\sum_{i = 0}^{v} x_{i} \cdot {w_{i j}}^{(1)} + {b_{j}}^{(1)}) + {b_{t}}^{(2)}),

(7)

where

v

and

z

represents the number of neurons in the hidden layers,

f

is the activation function of the hidden layer,

g

is the activation function of the output layer,

{w_{i j}}^{(1)}

is the weight from the

i^{t h}

input to the

j^{t h}

hidden neuron,

{b_{j}}^{(1)}

is the bias for the

j^{t h}

hidden neuron,

{w_{t j}}^{(2)}

denotes weight from the

j^{t h}

hidden neuron to the

t^{t h}

output neuron, and

{b_{t}}^{(2)}

is the bias for the

t^{t h}

output neuron. In contrast to single-layer neural perceptrons, the output of the inner layer in the multi-layer neural network is multiplied by a new weight vector and then passed through an activation function.

The NNAR model is a type of FFNN that is applied iteratively and has three layers (i.e., one input layer, one hidden layer, and one output layer) and a sigmoid as an activation function to minimise the overall impact of extreme values on the predicted final output [33] (see Figure 3). In essence, NNAR is a multi-layered perceptron network. The NNAR model utilises the lagged data series as inputs into the neural network for forecasting and has no restrictions on the model parameters to ensure stationarity [33]. When applied to non-seasonal time series data to forecast a specific output, the model is denoted by NNAR (

p, k

), where,

p

is the order,

{y_{t - 1}, y_{t - 2}, \dots, y_{t - s}, y_{t - 2 s}

,

y_{t - p}}

lagged inputs and

k

is the number of neurons in the hidden layers. To deal with seasonal time series data, the NNAR model includes an additional seasonal autoregressive (SAR) process of order

P

and is defined by NNAR

(p, P, k)_{s}

, where

P

is the SAR order,

{y_{t - 1}, y_{t - 2}, \dots, y_{t - s}, y_{t - 2 s}

,

y_{t - P s}}

are the lagged values, and

k

is the number of neurons in the hidden layer with seasonality at multiples of

s

.

In NNAR, random weights are initially assigned and then updated using data observations. Furthermore, the NNAR model leverages historical observations to produce step-ahead forecasts (also see Figure 3). In the case of two-step-ahead predictions, the model integrates both historical data and one-step-ahead forecasts into its network [33]. The NNAR model with input (

{y_{t - 1}, y_{t - 2}, \dots, y_{t - s}, y_{t - 2 s}

,

y_{t - p}}

) and output (

y_{t}

) is given by the following equation:

y_{t} = w_{0} + \sum_{j = 1}^{z} w_{j} \cdot f (w_{0 j} + \sum_{i = 1}^{v} y_{t - i} \cdot w_{i j}) + b,

(8)

where

z

is number of input nodes;

v

is number of hidden nodes,

w_{i j} (i = 0,1, \dots, v; j = 0,1, \dots, z)

,

w_{i} (i = 0,1, \dots, v)

; and

f

is often a sigmoid activation function.

Since the original wind speed data and the corresponding subseries are not all normally distributed (see Table 3 and Table 6), the Box–Cox transformation, given by

y_{t} (λ) = \{\begin{matrix} \frac{{y_{t}}^{λ} - 1}{λ}, i f λ \neq 0, \\ \ln (y_{t}), i f λ = 0, \end{matrix}

(9)

where

\ln (.)

denotes the natural logarithm, was used to normalise and stabilise variances, thereby improving the prediction performance of the NNAR model.

The parameters that were used to fit the NNAR model on the decomposed subseries and original wind speed datasets were automatically selected using the “nnetar” function in the “forecast” R package and are presented in Table 7, below.

2.6. Long Short-Term Memory Networks

Hochreiter and Schmidhuber [34] originally developed the LSTM to capture long-term dependencies through targeted design. The LSTM is a special and improved type of RNN [5,12,13,15,16]. Contrary to the standard RNN, this deep learning algorithm can be used to detect and solve the problem of short- and long-term dependence, gradient disappearance, and gradient explosion through the concept of memory units, thereby enhancing the forecasting accuracy of the model [6,9,16]. Thus, the LSTM provides effective reserve optimisation and high learning efficiency when forecasting time series data. The LSTM units utilise their recurrent structure and gate mechanisms to retain information within batches [34]. Enabling continuity across batches requires the activation of a stateful LSTM feature. The standard LSTM unit constitutes an input node (

g_{t}

), input gate (

i_{t}

), output gate (

o_{t}

), and forget gate (

f_{t}

) (see [5,12,13,15,16,17,18] for details) (also see Figure 4).

The forget gate removes unnecessary information from a memory cell state by multiplying inputs

y_{t}

(current) and

h_{t - 1}

(former cell output) by weight matrices before bias is added. The resulting value is sent through an activation function which outputs either 0 (lost) or 1 (stored for future use). The input gate provides important information to the memory cell state by regulating it using an activation function and filtering values to be remembered. The tanh function is employed to generate a vector containing all possible values from

h_{t - 1}

and

y_{t}

. The vector and regulated values are multiplied to generate meaningful data. The output gate extracts information from the current state of the cell by creating a vector using the tanh function, with inputs

{[h}_{t - 1}, y_{t}

]. The information is regulated using the sigmoid function and is filtered by the values to be remembered. Finally, the vector and regulated values are multiplied to yield an output and an input for the next cell (also see Figure 4) (see [5,12,13,15,16,17,18] for details). The standard LSTM architecture is encored on the following mathematical equations as given in the work [5,13,17]:

f_{t} = σ (w_{f} [h_{t - 1}, y_{t}] + b_{f}), f_{t} \in [0,1]

(10)

i_{t} = σ (w_{i} [h_{t - 1}, y_{t}] + b_{i}), i_{t} \in [0,1]

(11)

g_{t} = φ (w_{g} [h_{t - 1}, y_{t}] + b_{g}), g_{t} \in [- 1,1]

(12)

o_{t} = σ (w_{o} [h_{t - 1}, y_{t}] + b_{o}), o_{t} \in [0,1]

(13)

s_{t} = s_{t - 1} {⨀ f}_{t} + g_{t} ⨀ i_{t},

(14)

h_{t} = φ (s_{t}) ⨀ o_{t},

(15)

where

w_{f}

,

w_{i}

,

w_{g}

, and

w_{o}

are the weight matrices connecting the input signal [

h_{t - 1}, y_{t}]

, where

h_{t - 1}

is the previous cell output, and

y_{t}

denotes the input vector. The vectors

b_{f,} b_{i}, b_{g},

and

b_{o}

are bias vectors. The operator

⨀

denotes the element level multiplication,

σ (.)

is the logistic sigmoid activation function,

φ (.)

is the tangent hyperbolic function, and

s_{t}

is the memory cell state that recalls historical information within and across batches. The

σ (.)

and the

φ (.)

functions are, respectively, given by:

σ (y_{t}) = \frac{1}{1 + e^{- y_{t}}} \in [0,1],

(16)

and

φ (y_{t}) = \frac{e^{y_{t}} - e^{- y_{t}}}{e^{y_{t}} + e^{- y_{t}}} \in [- 1,1] .

(17)

These gates determine the addition, removal, and output of information from the memory cell. Memory cells explain dependencies within and across batches [6,12,13]. An input gate controls the extent to which new information flows into the memory cell, whilst an output gate controls how much information is utilised to calculate the output of the LSTM. The forget gate controls how much information remains in the memory cell [12,13,17,18]. As a result, LSTM effectively combats vanishing gradients, an issue prevalent in conventional ANNs and RNNs.

Sequence modelling algorithms in LSTMs can be classified as either stateful or stateless based on the applied training configuration [35,36]. A stateful LSTM preserves information across batches, which is crucial for capturing long-term dependencies in the input data. In essence, stateless LSTMs are standard LSTMs with their internal state reset after each sequence or batch, as discussed at the beginning of this section (also see Figure 4). This implies that the LSTM unit processes each input independently [35,36]. It should be noted that statelessness exists between sequences, not within batches or sequences such that dependencies and useful information within the sequences are effectively remembered and captured [36]. This is because the memory cell is still fully functional and effective within the sequences.

Due to their computational intensity, stateful LSTMs may not be the best choice for executing the short-term relation prediction task on highly random data. In fact, authors of [36] established that the generalisation that stateful LSTM performs better than stateless LSTM on sequential data was (to an extent) not applicable to wind speed data. The authors further articulated that the external factors (mostly due to weather fluctuations) were more influential on wind speed than previous observations, thereby hampering the effectiveness and usefulness of the stateful LSTM in this regard.

Contrary to the stateful LSTM, the stateless LSTM performed better when dealing with highly complex and sporadic wind speed data [36]. According to [35], stateless LSTMs are preferred over stateful LSTMs for time series prediction tasks because of their higher stability. The authors of [35] further advocated for a single hidden layer when employing stateless LSTM to time series problems to yield better results. In fact, stateless LSTM algorithms are advantageous in that they are computationally efficient, accurate, and comprehensible [35,36]. Due to the inherent capability of stateless LSTMs to handle wind speeds that are strongly influenced by climatic conditions, this study uses them to forecast highly irregular and variant wind speed time series.

Stateless LSTM Prediction Approach

A thorough discussion of the application of the stateless LSTM models is provided in the work of [35,36]. In this study, the LSTM network is set up using the “keras” library (from the Python package) in the R package. Unlike NNAR, LSTM is sensitive to the data structure, hence, the following steps were undertaken when implementing the LSTM network:

Step 1: Data preparation

By lagging, each time series signal is transformed to a supervised learning mode, in which time series

y_{t} = ψ (y_{t - 1})

. The rationale is to train the model to predict

y_{t}

given the previously

y_{t - 1}

with the lowest possible error. This ensures that data are also computable to the LSTM network.

Step 2: Data normalisation

The input data are rescaled to the range [−1, 1] so that they are compatible with the LSTM network tan hyperbolic activation function using the equation below:

\overset{`}{y_{t}} = \frac{y_{t} - y_{m i n}}{y_{m a x} - y_{m i n}},

(18)

where

y_{t}

is the wind speed series value at time

t

,

y_{m i n}

and

y_{m a x}

respectively denote the minimum and maximum wind speed series in the training dataset. To ensure that the scaled values do not affect the performance of the model, the scaling factors generated from the training dataset are used to scale the training data, the testing data, and the predicted values. The benefit of this limited range is a smaller standard deviation, which eliminates the effects of outliers, thereby improving robustness and predictive strength of the LSTM models.

Step 3: Training LSTM Network

As part of the process of building an LSTM network, the input batch requires a three-dimensional (3D) shape (i.e., samples, timesteps, features). Data training is conducted with the internal states of the LSTM network deactivated (by setting stateful to “FALSE”) since the intention is to learn patterns and trends within the series. As a result, more accurate and stable predictions were generated (see [35,36] for more details). A single hidden layer was used during data training (see Table 8).

The mean squared error (MSE) was used as the loss function for assessing the model performance, while adaptive moment estimation (ADAM) at a linear decay rate (very close to zero) and learning rates (as shown in Table 8) were used for optimising the algorithm. ADAM offers high computational efficiency and low memory requirements, and it is suitable for optimising problems with many parameters (see [18] for more details). To preserve values at time

t - 1

and time

t

, the argument “shuffle” in the LSTM network was deactivated when fitting or training the model. A stateless LSTM was found to perform better without shuffling.

Step 4: Predictions Using the LSTM Network

After the LSTM model has been fully trained, the test data are predicted using the normalised data. Normalised predictions (

{\hat{y'}}_{t})

of the

y_{t}

are denormalised to revert to the original format using the following equation.

{\hat{y}}_{t} = y_{m a x} \overset{`}{(y_{t})} - y_{m i n} ({\hat{y'}}_{t}) + y_{m i n},

(19)

where

{\hat{y}}_{t}

and

{\hat{y'}}_{t}

, respectively, denote the wind speed series prediction and normalised wind speed series prediction at time t.

2.7. Gradient Boosting Machines

Gradient boosting machines (GBM) are common machine learning algorithms that build an ensemble of shallow and weak successive trees, with each tree learning and improving on the previous. The boosting algorithm is built on three components: weak learners, an additive model, and a loss function [37,38]. This technique fits an additive model, stage-wise. The gradient boosting tree is mathematically defined by the following equation:

F_{n} (y_{t}) = \sum_{i = 1}^{n} η_{i} (y_{t}),

(20)

where

η_{i} (y_{t})

is a decision tree and

n

is the number of trees. An ensemble of trees is built sequentially by estimating a new decision tree

η_{n + 1} (y_{t})

using the following equation:

a r g m i n \sum_{t} L (y_{t}, F_{n} (y_{t}) + η_{n + 1} (y_{t})),

(21)

where

L (.)

is a differential loss function. The steepest gradient descent is used to solve equation (21). Despite GBM’s flexibility and high forecasting accuracy, this algorithm is data-greedy and easily overfits the training dataset. GBM is given by (see [37]):

F (y) = \sum_{k = 1}^{K} β_{k} θ (y; γ_{k}),

(22)

where

θ (y; γ_{k}) \in R

are functions of

y

characterised by the expansion parameters

β_{k}

and

γ_{k}

, which are fitted in stage-wise to delay over-fitting the model.

Hyperparameter Setting for GBM

When tuning the GBM algorithm, the following three parameters are of paramount importance: (a) the number of trees: increasing the number of trees reduces the training dataset error, but too many trees can result in overfitting; (b) interaction depth (or tree depth): this is the number of splits in each tree, and it controls the difficulty of the boosted ensemble; and (c) learning rate: smaller values reduce the risk of overfitting, but also increase the time required to find the optimal fit.

In a stepwise manner, a grid search was conducted on the 80% training data (from the original wind speed data) to find the best hyperparameter combinations for each of the four stations. The resultant optimal parameters, which were employed to optimally reconcile forecasts for each subseries, are presented in Table 9 below. It should be noted that high-variant datasets (RVD and CSIR) required more trees than low-variant datasets (Venda and NUST).

2.8. K-Nearest Neighbours

KNN is a non-parametric machine learning method widely used for classification or regression problems [38,39]. The KNN algorithm is robust and effective when dealing with small training data [39,40]. This simple algorithm can be described in three main steps [38]: (a) calculate the distance between the training and testing data; (b) select the nearest neighbour with the smallest distance from the training set; and (c) forecast the time series value based on the weighted approach. In this algorithm, the training data are arranged in a space defined by the selected features. To identify the class of test data, the algorithm compares the classes of the k-closest data. When

k

is small, noise will be a bigger factor, while a large value will result in computational inefficiency. In this study, the values of

k ~ [10 - 34]

(determined by trial and error) were used to minimise RMSE. Furthermore, the multiplication transformation strategy from the “tsfknn” R package was implemented to generate weighted point estimates [40]. To determine whether two cases are similar, Euclidean or Manhattan distance metrics are typically employed. The Euclidean distance employed in this study is defined as the distance between the new instance (

q_{i}

) and the

j^{t h}

training instance (

f_{i}^{j}

) and is given by:

E D = \sqrt{\sum_{i = 1}^{n} {{(f}_{i}^{j} - q_{i})}^{2}},

(23)

where

f_{i}^{j}

is the

j^{t h}

training instance from a vector

\tilde{V} = (f_{1}^{j}, f_{2}^{j}

…

, f_{n}^{j})

of length

n

, and

q_{i}

denotes a new instance from the vector of new instances

\tilde{Q} = (q_{1}, q_{2}, \dots, q_{l})

whose target is unknown but whose features are known. KNN techniques, which classify based on similarity, face challenges in handling high-dimensionality, noisy data, outliers, and correlated features. To improve the output estimates, these challenges are often eliminated by the use of dimensionality reduction techniques.

2.9. Proposed Predictive Approach

WT-NNAR-LSTM-GBM Model

Our proposed model for short-term wind speed prediction combines different methods: SampEn, WT, LSTM, NNAR, and GBM. This model, named WT-NNAR-LSTM-GBM, achieves high precision and computational simplicity. In Table 10, we provide a breakdown of each model’s contribution towards the overall performance of the WT-NNAR-LSTM-GBM model.

The WT-LSTM-NNAR-GBM model is implemented through Algorithm 1 (also see Figure 5 for the process flow chart):

Algorithm 1: WT-NNAR-LSTM-GBM

INPUT: Wind speed data [

Y_{t}

]

Data preparation

The wind speed data is checked for invalidities and missing data mainly due to collection system malfunction and adverse environmental conditions. Wind speeds over 22 m/s are considered outliers and removed (if found) as they are not practical for wind power generation.

2.: Data preprocessing

The original wind speed data is decomposed using a 3-level MODWT, resulting in three detailed signals (D1, D2, and D3) and one low-frequency approximate signal (A3).

3.: Data complexity assessment

In step 3, SampEn is used to estimate the level of randomness or complexity of all four decomposed subseries. Subseries with a SampEn value greater than or equal to 0.9 are considered to be complex (high-variant) while those with a SampEn value of less than 0.9 are considered to be deterministic (less-variant or random).

4.: Data formatting

In LSTM, complex subseries are normalised through the MINIMAX technique before they are divided into train and test sets such that they fall within [−1, 1]. In NNAR, the Box-Cox transformation is used to eliminate non-normality from datasets and to ensure roughly homoscedastic residuals.

5.: Data partition

In step 5, the data is split into a training set (to build the model) (80%) and a testing set (to validate the strength and robustness of the model) (20%). Note that the data split was performed in a way that the structure of the subseries was preserved (no reshuffling) as that would impact model performance.

6.: Parameter identification and model building

After splitting the data, the parameters for building an effective and efficient LSTM network are determined. These include batch size, time step, features, neural network layers, network dimensionality, learning rate, activation function, return type, network state, number of epochs, and error function. The “nnetar” function in the ‘forecast’ package is employed to automatically select optimal parameters (p,k) for the NNAR model (see [33] for details).

7.: Model training

Using the training data set (80%) as input to the model, the prediction error of the LSTM network is computed and the model’s performance is evaluated. Thereafter, the model parameters are fine-tuned according to the results to further improve the predictive power of the model.

8.: Model testing (predictions and evaluation)

Both NNAR and LSTM use the best-fitting model to generate predictions and compare them with the corresponding subseries test dataset (20%) based on the performance metrics (RMSE, mean absolute error (MAE)). Note that the predictions from the LSTM network are first denormalised to restore the original state of the subseries before performing the comparison with the subseries test dataset.

9.: Prediction ensemble

i.: Training GBM: Before the GBM is employed to combine subseries predictions, the model hyperparameters (number of regression trees, interaction depth, learning rate, and error function) are identified and used to train the model using 80% of the original wind speed data. Prediction error (RMSE) is computed, model parameter settings are fine-tuned, and the prediction strength of the model is improved.
ii.: Forecast combination: The resultant GBM model from training is used in computing the final predicted value by combining the predictions for all subseries. Finally, the hybrid predictions are compared to 20% of the original wind speed testing data using performance metrics (MAE, RMSE, median absolute deviation (MAD), continuous ranked probability score (CRPS) and prediction interval width (PIW)).

OUTPUT:

{\hat{y}}_{t}

, MAE, RMSE, MAD, CRPS, and PIW

Table 11 summarises the process followed to build hybrid models. WT-NNAR-KNN-GBM and WT-KNN-LSTM-GBM were derived using a similar approach as WT-NNAR-LSTM-GBM.

2.10. Predictive Performance Assessment

2.10.1. Point Prediction Metrics

One of the primary objectives of forecasting research is to develop an accurate predictive algorithm. Hence, an effective predictive algorithm should be selected using an appropriate error score. An error score or indicator should be able to distinguish or differentiate between a stable predictive model and an unstable predictive model [41]. This study evaluates the predictive accuracy of the proposed predictive model using MAE and RMSE.

M A E = \frac{1}{n} \sum_{t = 1}^{n} | y_{t} - {\hat{y}}_{t} |,

(24)

R M S E = \sqrt{\frac{1}{n} \sum_{t = 1}^{n} {(y_{t} - {\hat{y}}_{t})}^{2}},

(25)

where

y_{t}

and

{\hat{y}}_{t}

, respectively, represent the actual and predicted wind speed value, and

n

is the sample size. Smaller values of MAE and RMSE are preferred. These are the most common error indicators. RMSE and MAE are scale-dependent and are based on absolute errors [33]. The MAE indicator changes linearly, whilst RMSE penalises larger errors more than smaller ones [33,41]. RMSE is highly sensitive to distribution bias.

2.10.2. Probabilistic Prediction Metrics

A robust probabilistic system should provide sharpness and reliability [41]. The aspect of sharpness involves the model’s ability to generate predictions within a narrow probability distribution [41,42]. To measure the sharpness of the model, we employ the MAD given by the following equation:

M A D = 1.4826 \times m e d i a n (| y_{t} - \tilde{y_{t}} |),

(26)

where

y_{t}

and

\tilde{y_{t}} = m e d i a n (y_{t})

, respectively, denote the actual and the median values at time

t

. The MAD is advantageous in that it is robust as it resists (or is less sensitive to) outliers. However, if the distribution is normal, the method loses efficiency by not utilising all the information available in the data. The lesser the value of MAD, the better the model.

Model reliability or calibration refers to its ability to detect uncertainty when predicting [41,42,43]. Thus, the predictive distribution must accurately model the spread of the target variable [41,42]. Besides minimising absolute error, the CRPS employed in this study can compare deterministic and probabilistic predictions. The CRPS between

x

and

F

is defined as (see [44])

C R P S (F, x) = \int_{- \infty}^{+ \infty} {(F (y) - H {x \geq y})}^{2} d y,

(27)

where

F

is the cumulative distribution function (CDF) of the random variable

X

, and

H

is the Heaviside step function (

H \{x \geq y\}

equals 1 if

x \geq y

and 0 if

x < y

). A CPRS closer to 0 indicates reliable and accurate predictions, while a CRPS closer to 1 indicates inaccurate predictions. Explicit solutions for the CRPS are only available for limited parametric cases, which is a problem in practice. Examining the improper integral defined in Equation (27), we find that the

\lim_{x \to - \infty} [\int_{- \infty}^{x} {(F (y) - H (x \geq y))}^{2} d y]

= 0, since

\underset{x \to - \infty}{l i m} H (x \geq y) = 0

for all

x < y

and

\lim_{x \to - \infty} F (y) = 0

for a valid CDF (i.e., non-decreasing and

0 \leq F (y) \leq 1

). Furthermore,

\lim_{x \to + \infty} [\int_{x}^{+ \infty} {(F (y) - H (x \geq y))}^{2} d y]

= 0. This because

\underset{x \to + \infty}{l i m} H (x \geq y) = 1

for all

x \geq y

and

\lim_{x \to + \infty} F (y) = 1

for a valid CDF. Summing the results above implies that the integral in Equation (27) approaches 0 as

x ⟶ \pm \infty

. Thus, the integral converges to a finite number. A simple way of determining CPRS involves dividing the improper integral in Equation (27) into two separate integrals such that (also see [42,43,44] for details)

C R P S (F, x) = \int_{- \infty}^{x} {(F (y))}^{2} d y + \int_{x}^{+ \infty} {(F (y) - 1)}^{2} d y

(28)

In essence, the integral in Equation (28) can be approximated by discrete

n

finite sums such that (also see [41,42,43])

C R P S (F, x) = \sum_{l = 0}^{x} {(F (y_{l}))}^{2} + \sum_{x + 1}^{n} {(F (y_{l}) - 1)}^{2},

(29)

where

n

is the upper limit of the right tail of a probability distribution. CRPS calculations are performed for a single time point. For computing the CRPS over an interval, the average CRPS values should be calculated such that [42,43,44]

C R P S (F, x) = \frac{1}{h} \sum_{j = 1}^{h} {C R P S}_{j}

(30)

The reliability of the model is further assessed using prediction intervals (PIs). PIs can be estimated using PIW. The PIW is computed as the difference between the lower and upper bound estimates and is given by

{P I W}_{t} = {U L}_{t} - {L L}_{t},

(31)

where

{U L}_{t}

denotes the upper limit whilst

{L L}_{t}

is the lower limit of the PI. In essence, smaller values for

{P I W}_{t}

are preferred as they indicate a narrower and more robust model in prediction. PIs are computationally expensive and are highly sensitive to deviations from normality.

3. Results

An HP notebook development environment (R package 4.2.2) running on a Core i5 processor was used to train and test all models in this study.

3.1. Evaluation of Point Predictions

Figure 6 presents the one-step-ahead point prediction performance metrics of the five models fitted to the CSIR, NUST, Venda, and RVD wind speed data. The intention is to verify the prediction strength of the proposed WT-NNAR-LSTM-GBM (M3) model against the other four (4) models, namely, benchmark NNAR (M1) (see Box 1), LSTM (M2), WT-LSTM-KNN-GBM (M4), and WT-NNAR-KNN-GBM (M5). The proposed M3 performs better than all other models for all performance metrics for the CSIR, NUST, and Venda datasets. For the RVD dataset, M4 outperforms all models based on the lowest RMSE. On the other hand, M3 achieved lower MAE values than model M4 for the same dataset. Model M5 produced the second-best performance after the M3 model for NUST and Venda datasets when measured on all performance indicators. When predicting the NUST dataset, models had the smallest RMSE, while RVD data had the smallest MAE.

Box 1. Benchmarking.

According to [3], “Persistence method, though simple to model, may not be an effective forecasting technique”. The authors of [3] further articulated that “A single model like persistence is not effective in capturing all the details in wind speed”. Similar results were observed in [5,14]. At the RVD station, the persistent model produced the highest RMSE (1.1148) and MAE (0.9587), while NNAR (M1) reached an RMSE of 0.1614 and MAE of 0.1911. Similar results were observed at the other three stations. Accordingly, we foresaw that the persistence model would not provide effective and meaningful comparative insights. Hence, this study chose to benchmark against the NNAR model.

Among the individual models (M1 and M2), M2 dominated M1 based on RMSE and MAE for the high-variant RVD dataset. On the other hand, M1 outperformed M2 for NUST based on all performance metrics. For the CSIR and Venda datasets, M1 was superior to M2 based on MAE. For the same CISR dataset, M2 outcompeted M1 based on the least RMSE value. M1 performed better on low-variant datasets (NUST and Venda) than on high-variant datasets (CSIR and RVD). Individual models (particularly M1) are the most efficient in execution time. However, their prediction accuracy is less desirable than hybrid models.

Signal processing through wavelet decomposition enhances model accuracy across all datasets. For example, in Venda data, the RMSE values for M1 and M3 are 0.3097 m/s and 0.1279 m/s, respectively. Furthermore, SampEn improves predictive performance by reducing the complexity of hybrid models. Thus, judging the complexity or randomness characteristic of the subseries using SampEn can effectively enhance the prediction accuracy of the hybrid model. Hence, wavelet decomposition and SampEn are essential components of the proposed approach to ensure model stability in varying locations and seasons of the year. The ADAM optimiser algorithm contributes significantly to the improved accuracy of the proposed model’s predictions. Furthermore, the proposed hybrid model (M3) successfully captures high (extreme) and low wind speed values, which is different from other strategies (see Figure 7). As a result, the proposed hybrid model (M3) has a competitive advantage over other hybrid models.

Overall, all models have prediction strengths that vary with the station location and the technique applied. Furthermore, the performance metrics suggest that the M3 has the most robust prediction effect and is the best model for predicting all four wind speed datasets, as it effectively captures sudden changes in the wind speed data compared to other models (see Figure 7).

3.2. Residual Analysis

Table 12 summarises the residuals for the fitted models for the CSIR, NUST, RVD, and Venda datasets. The best models’ values are bolded. The residuals of all models for the NUST and Venda datasets are positively skewed and have longer tails to the right of the distribution. Thus, there were more positive errors than negative errors. Furthermore, the residuals of all models tend to underestimate wind speed data from the CSIR, NUST, and Venda data stations. Models M3 and M4, respectively, overestimate the CSIR and RVD datasets (negative skewness values). M1 produced the most extreme positive skewness values for the CSIR (skewness = 3.6226), RVD (skewness = 4.3280), and Venda (skewness = 3.4968) datasets (see Figure 8), while M3 (followed by M5) produced the most extreme negative skewness values (longer tails stretched to the right) for the RVD (skewness = −6.2854) dataset.

Through the bias test, we determined whether each of the fitted models under-predicts or over-predicts actual values consistently. Ideally, there should be a 50/50 split ratio between under-predictions and over-predictions (50%:50%). Model M1 is the most biased model for the CSIR (55%

\to {\hat{y}}_{t} > y_{t}

) and RVD (56%

\to {\hat{y}}_{t} > y_{t}

) datasets, whilst M2 is the most biased model for both the NUST (65%

\to {\hat{y}}_{t} > y_{t}

) and Venda (58%

\to {\hat{y}}_{t} > y_{t}

) datasets. Model M3 is the least biased for the CSIR (51%

{\to \hat{y}}_{t} > y_{t}

), RVD (48%

{\to \hat{y}}_{t} > y_{t}

), Venda (50%

\to {\hat{y}}_{t} > y_{t}

), and NUST (50%

\to {\hat{y}}_{t} > y_{t}

) datasets (see Figure 8).

Model M3 produced the narrowest residuals (least standard deviation value) for the CSIR, NUST, and Venda datasets. For the RVD dataset, model M4 achieved the narrowest residuals compared to other models, followed by M3, M5, M2, and M1. Except for the NUST, where M2 yielded the widest residual spread, M1 produced the widest residual spread for the other three datasets, namely the CSIR, RVD, and Venda datasets (also see Figure 8).

By applying the Anderson–Darling (AD) normality test, we can determine whether the predictive power of the model varies depending on the section of data examined. Our ideal scenario would be to assess whether the residual data are normal and homoscedastic. For NUST, the AD test shows that only residuals (with p-value

> 0.05

) from model M5 passed the normality test. Furthermore, residuals generated from M1 have p-values

< 0.05

for the CSIR and Venda datasets. Thus, it failed the normality test. With p-values

> 0.05

, the residuals for M2 and M4 were the only ones that passed the normality test for the RVD data (see Figure 8).

The overall residual analysis shows that M3 predicts the CSIR, NUST, and Venda data with the least unbiasedness and highest accuracy when compared to any other model. In contrast, M4 is the most unbiased and accurate predictor of high-variance RVD data (see Figure 8).

3.3. Evaluation of Probabilistic Predictions

Table 13 presents the MAD, CRPS, and PIW for the 90% confidence intervals for all five (5) models fitted to the four datasets under investigation. The most effective models’ metrics are bolded. In this case, MAD is used to assess the sharpness (or the “narrowness” of a probability distribution) of the fitted models. Model M3 produced the smallest MAD value for the CSIR, NUST, and RVD datasets. On the other hand, M1 yielded the lowest MAD value for the Venda data set. This implies that model M3 generates the narrowest probability distribution for the CSIR, NUST, and RVD datasets. For the Venda dataset, M1 produces predictions with the narrowest probability distribution. Accordingly, M3 is the sharpest for the CSIR, NUST, and RVD datasets, while M1 is the sharpest for the least-variant Venda dataset.

All models were found to be valid at a 90% prediction interval with nominal confidence (PINC). The assessment of the models’ reliabilities based on 90%PIW showed that model M3 generated the narrowest PIW (least standard deviation values) for the NUST and RVD datasets. Model M5 and M4 had the narrowest PIW for the CSIR and Venda datasets, respectively. Model M1 produced the widest PIW (highest standard deviation values) for the CSIR, RVD, and Venda datasets. Similarly, M2 had a broader PIW than any other model for the NUST dataset. An average of 260 predicted values (or more than 90%) fell within the 90% PI. With the CSIR data, model M1 predicted most values within 90% of the PIs. For the RVD dataset, M2 and M3 generated most predictions within 90%. Based on the NUST data, M4 predicted most values outside PIs. On the Venda dataset, model M5 had the highest prediction count outside the 90% PI.

The CRPS brings the assessment of calibration (or reliability) for effective comparison of the fitted models. The mean CRPS values show that M4 is superior to all other models in predicting the CSIR, NUST, and RVD datasets. For the low-variant Venda datasets, model M4 produced the smallest CRPS values. Thus, M5 and M4 demonstrate evidence of a well-calibrated model for predicting the NUST and Venda data, respectively. On the other hand, M1 lacks calibration, as it produced the poorest (highest) mean CRPS for the three datasets, namely, CSIR, NUST, and Venda. Model M4 is the better-calibrated model based on CPRS.

The overall comparative analysis shows that M3 is the best-performing model based on MAD, PIW, and OL, whereas M4 is the best-performing model based on CPRS. Thus, the probabilistic forecast analysis shows that hybrid models (M3 and M4) that include a stateless LSTM do not suffer from miscalibration and can capture wind speed transient behaviour reasonably. As a result, models M3 and M4 provide more reliable predictions and are better able to generalise across four datasets representing different seasons.

4. Conclusions

The proposed hybrid approach WT-NNAR-LSTM-GBM in this study aims to address inherent wind speed turbulence and, hence, short-term wind speed predictions. High-resolution minutely averaged wind speed datasets downloaded from the CSIR, NUST, RVD, and Venda radiometric stations in Southern Africa were used to test the efficacy of the developed model against the WT-NNAR-KNN-GBM, WT-LSTM-KNN-GBM, LSTM, and benchmark NNAR models. In detail, the conclusions derived from this work are as follows:

(i): Wavelet decomposition of nonlinear wind speed data into more statistically sound components improved the prediction performance of hybrid models across all four datasets. Similar results were found in [7,12,13,14].
(ii): SampEn reduced the complexity of predictions and improved predictive performance by tailoring the modelling and forecasting methods to the specific attributes of subseries data.
(iii): The implementation of NNARs has proved to be successful in predicting less random and deterministic subseries. NNARs are known to be resistant to non-stationarity and outliers, which makes them an effective tool for this purpose. While highly computationally efficient, NNAR is the least accurate based on both point and probabilistic metrics.
(iv): Using a stateless LSTM significantly enhanced the predictive accuracy of hybrid models by capturing extreme wind speed values associated with turbulence and producing more accurate, reliable, and robust predictions. Similar results were found in [36].
(v): Although computationally intensive [6,8], LSTMs are valuable for wind speed prediction due to their advanced pattern recognition and gradient explosion handling capabilities, and the ensemble of predictions using nonlinear GBM is efficient and robust across different seasons and station locations. These results are analogous to those found in [5,22].
(vi): According to our thorough overall comparative analysis of point evaluation metrics (MAE and RMSE), as well as probabilistic evaluation metrics (MAD and PIW), in addition to residual analysis, we have found that the WT-LSTM-NNAR-GBM model (followed by the better-calibrated WT-LSTM-KNN-GBM based on CPRS) is the most accurate, sharpest, robust, and reliable option for modelling and predicting all four datasets across varying locations with differing weather patterns.
(vii): The performance of the models is influenced by both station location and season. Additionally, we found that WT-LSTM-NNAR-GBM was less sensitive (to station location and season of the year) compared to individual models such as NNAR and LSTM, which displayed higher sensitivity.
(viii): The proposed hybrid approach shows promising results for short-term wind speed forecasting, and it can be utilised for various purposes, such as achieving uniform wind power distribution, optimising wind power output, and ensuring smooth grid operations in real time.

Adding additional decomposition levels to the proposed model and testing it on larger and more complex datasets beyond the Southern African region may be interesting in the future. Furthermore, the current input dataset only includes historical wind speed data, so other meteorological variables can be used to expand the dataset further. The stateless LSTM algorithm could also be optimised by including different dropout rates to enhance the efficacy of the proposed model. The predictive model could be improved even further by replacing SampEn with a simpler but robust permutation entropy and replacing LSTMs with much simpler and faster gated recurrent units (GRU).

Author Contributions

Conceptualisation, K.S.S.; methodology, K.S.S.; software, K.S.S.; validation, K.S.S. and E.R.; formal analysis, K.S.S.; investigation, K.S.S.; resources, K.S.S. and E.R.; data curation, K.S.S.; writing—original draft, K.S.S.; writing—review and editing, K.S.S. and E.R.; visualisation, K.S.S.; supervision, E.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from the University of South Africa.

Data Availability Statement

These wind speed data can be downloaded from the SAURAN website (http://www.sauran.ac.za) (accessed on 12 December 2022).

Conflicts of Interest

The corresponding author states that there are no conflicts of interest.

References

Wang, X.; Guo, P.; Huang, X. A Review of Wind Power Forecasting Models. Energy Procedia 2011, 12, 770–777. [Google Scholar] [CrossRef]
Hanifi, X.; Liu, Z.; Lin, S.; Lotfian, A. Critical Review of Wind Power Forecasting Methods-Past, Present and Future. Energies 2020, 13, 3764. [Google Scholar] [CrossRef]
Aasim; Singh, S.N.; Abheejeet, M. Repeated wavelet transform based ARIMA Model for very shortterm wind speed forecasting. Renew. Energy 2019, 136, 758–768. [Google Scholar] [CrossRef]
Sivhugwana, K.S.; Ranganai, E. An Ensemble Approach to Short-Term Wind Speed Predictions Using Stochastic Methods, Wavelets and Gradient Boosting Decision Trees. Wind 2024, 4, 44–67. [Google Scholar] [CrossRef]
Xiang, J.; Qiu, Z.; Hao, Q.; Cao, H. Multi-time scale wind speed prediction based on WT-bi-LSTM. MATEC Web Conf. 2020, 309, 05011. [Google Scholar] [CrossRef]
Xie, A.; Yang, H.; Chen, V.; Sheng, L.; Zhang, Q. A Short-Term Wind Speed Forecasting Model Based on a Multi-Variable Long Short-Term Memory Network. Atmosphere 2021, 12, 651. [Google Scholar] [CrossRef]
Berrezzek, F.; Khelil, K.; Bouadjila, T. Efficient wind speed forecasting using discrete wavelet transform and artificial neural networks. Rev. D’Intell. Artif. 2019, 33, 447–452. [Google Scholar] [CrossRef]
Delgado, I.; Fahim, M. Wind Turbine Data Analysis and LSTM-Based Prediction in SCADA System. Energies 2021, 14, 125. [Google Scholar] [CrossRef]
Ibrahim, M.; Alsheikh, M.; Al-Hindawi, O.; Al-Dahidi, S.; ElMoaqet, H. Short-Time Wind Speed Forecast Using Artificial Learning-Based Algorithms. Comput. Intell. Neurosci. 2020, 2020, 8439719. [Google Scholar] [CrossRef]
Chen, N.; Qian, Z.; Meng, X. Multistep Wind Speed Forecasting Based on Wavelet and Gaussian Processes. Math. Probl. Eng. 2013, 2013, 461983. [Google Scholar] [CrossRef]
Fuad, N.; Gamal, A.; Ammar, A.; Ali, A.; Sieh Kiong, T.; Nasser, A.; Janaka, E.; Ahmed, A. Multistep short-term wind speed prediction using nonlinear auto-regressive neural network with exogenous variable selection. Alex. Eng. J. 2021, 60, 1221–1229. [Google Scholar] [CrossRef]
Hua, Y.; Zhao, Z.; Li, R.; Chen, X.; Liu, Z.; Zhang, H. Deep Learning with Long Short-Term Memory for Time Series Prediction. IEEE Commun. Mag. 2018, 57, 114–119. [Google Scholar] [CrossRef]
Liu, Y.; Guan, L.; Hou, C.; Han, H.; Liu, Z.; Sun, Y.; Zheng, M. Wind Power Short-Term Prediction Based on LSTM and Discrete Wavelet Transform. Appl. Sci. 2019, 9, 1108. [Google Scholar] [CrossRef]
Wang, A. Hybrid Wavelet Transform Based Short-Term Wind Speed Forecasting Approach. Sci. World J. 2014, 2014, 914127. [Google Scholar] [CrossRef] [PubMed]
Mutavhatsindi, T.; Sigauke, C.; Mbuvha, R. Forecasting Hourly Global Horizontal Solar Irradiance in South Africa Using Machine Learning Models. IEEE Access 2020, 8, 198872–198885. [Google Scholar] [CrossRef]
Feng, T.; Yang, S.; Han, F. Chaotic time series prediction using wavelet transform and multi-model hybrid method. J. Vibro Eng. 2019, 21, 983–1999. [Google Scholar] [CrossRef]
Yu, Y.; Cao, J.; Zhu, J. An LSTM Short-Term Solar Irradiance Forecasting Under Complicated Weather Conditions. IEEE Access 2019, 7, 145651–145666. [Google Scholar] [CrossRef]
Jin, Y.; Guo, H.; Wang, J.; Song, A. A Hybrid System Based on LSTM for Short-Term Power Load Forecasting. Energies 2020, 13, 6241. [Google Scholar] [CrossRef]
Zhang, J.; Wei, Y.; Tan, Z.F.; Ke, W.; Tian, W. A Hybrid Method for Short-Term Wind Speed Forecasting. Sustainability 2017, 9, 596. [Google Scholar] [CrossRef]
Adamowski, K.; Prokoph, A.; Adamowski’, J. Development of a new method of wavelet aided trend detection and estimation. Hydrol. Process. 2009, 23, 2686–2696. [Google Scholar] [CrossRef]
Saroha, S.; Aggarwal, S. Wind power forecasting using wavelet transforms and neural networks with tapped delay. J. Power Energy Syst. 2018, 4, 197–209. [Google Scholar] [CrossRef]
Wei, Q.; Liu, D.-H.; Wang, K.-H.; Liu, Q.; Abbod, M.F.; Jiang, B.C.; Chen, K.-P.; Wu, C.; Shieh, J.-S. Multivariate multiscale entropy applied to center of pressure signals analysis: An effect of vibration stimulation of shoes. Entropy 2012, 14, 2157–2172. [Google Scholar] [CrossRef]
Chen, J.; Zeng, G.Q.; Zhou, W.; Du, W.; Lu, K.D. Wind speed forecasting using nonlinear-learning ensemble of deep learning time series prediction and extremal optimization. Energy Convers. Manag. 2018, 165, 681–695. [Google Scholar] [CrossRef]
Dghais, A.A.; Ismail, M.T. A Comparative Study between Discrete Wavelet Transform and Maximal Overlap Discrete Wavelet Transform for Testing Stationarity. Int. J. Math. Comput. Phys. Electr. Comput. Eng. 2013, 7, 1677–1681. [Google Scholar]
Cornish, C.R.; Bretherton, C.S.; Percival, D.B. Maximal Overlap Wavelet Statistical Analysis With Application to Atmospheric Turbulence. Bound.-Layer Meteorol. 2006, 119, 339–374. [Google Scholar] [CrossRef]
Rodrigues, D.V.Q.; Zuo, D.; Li, C. A MODWT-Based Algorithm for the Identification and Removal of Jumps/Short-Term Distortions in Displacement Measurements Used for Structural Health Monitoring. IoT 2022, 3, 60–72. [Google Scholar] [CrossRef]
Paramasivam, S.; Pl, S.A.; Sathyamoorthi, P. Maximal overlap discrete wavelet transform-based power trace alignment algorithm against random delay countermeasure. ETRI J. 2022, 44, 512–523. [Google Scholar] [CrossRef]
Delgado-Bonal, A.; Marshak, A. Approximate Entropy and Sample Entropy: A Comprehensive Tutorial. Entropy 2019, 21, 541. [Google Scholar] [CrossRef] [PubMed]
Lake, E.D.; Richman, S.J.; Griffin, J.M.; Moorman, P.M. Sample entropy analysis of neonatal heart rate variability. Am. J. Physiol.-Regul. Integr. Comp. Physiol. 2002, 283, R789–R797. [Google Scholar] [CrossRef] [PubMed]
Richman, J.S.; Randall Moorman, J.R. Physiological time-series analysis using approximate entropy and sample entropy. Am. J. Physiol.-Heart Circ. Physiol. 2000, 278, H2039–H2049. [Google Scholar] [CrossRef] [PubMed]
Ronak, B.; Na, H.; Yi, S.; Neil, D.; Tony, S.; David, M. Efficient Methods for Calculating Sample Entropy in Time Series Data Analysis. Procedia Comput. Sci. 2018, 145, 97–104. [Google Scholar] [CrossRef]
Celeux, G.; Soromenho, G. An entropy criterion for assessing the number of clusters in a mixture model. J. Classif. 1996, 13, 195–212. [Google Scholar] [CrossRef]
Hyndman, R.J.; Athanasopoulos, G. Forecasting Principles and Practice, 2nd ed.; OTexts: Melbourne, Australia, 2021. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Yadav, A.; Jha, C.K.; Sharan, A. Optimizing LSTM for time series prediction in Indian stock market. Procedia Comput. Sci. 2020, 167, 2091–2100. [Google Scholar] [CrossRef]
Saha, S. Comprehensive Forecasting-Based Analysis of Hybrid and Stacked Stateful/Stateless Models. arXiv 2024. [Google Scholar] [CrossRef]
Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2019, 38, 367–378. [Google Scholar] [CrossRef]
Singh, U.; Rizwan, M.; Alaraj, M.; Alsaidan, I. A Machine Learning-Based Gradient Boosting Regression Approach for Wind Power Production Forecasting: A Step towards Smart Grid Environments. Energies 2021, 14, 5196. [Google Scholar] [CrossRef]
Huang, C.M.; Chen, S.J.; Yang, S.P.; Chen, H.-J. One-Day-Ahead Hourly Wind Power Forecasting Using Optimized Ensemble Prediction Methods. Energies 2023, 16, 2688. [Google Scholar] [CrossRef]
Martínez, F.; Frías, M.P.; Charte, F.; Rivera, A.J. Time Series Forecasting with KNN in R: The tsfknn Package. R J. 2019, 11, 229–242. [Google Scholar] [CrossRef]
Gensler, A. Wind Power Ensemble Forecasting: Performance Measures and Ensemble Architectures for Deterministic and Probabilistic Forecasts. Ph.D. Thesis, University of Kussel, Kassel, Hessen, Germany, 21 September 2018. [Google Scholar]
Funk, S.; Camacho, A.; Kucharski, A.J.; Lowe, R.; Eggo, R.M.; Edmunds, W.J. Assessing the performance of real-time epidemic forecasts: A case study of Ebola in the Western Area region of Sierra Leone, 2014–2015. PLoS Comput. Biol. 2009, 15, e1006785. [Google Scholar] [CrossRef]
Gneiting, T.; Katzfuss, M. Probabilistic Forecasting. Annu. Rev. Stat. Appl. 2014, 1, 125–151. [Google Scholar] [CrossRef]
Gneiting, T.; Raftery, A.E. Strictly Proper Scoring Rules, Prediction, and Estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]

Figure 1. The time series and Q-Q plots of minutely averaged wind speed data for the CSIR (a), NUST (b), RVD (c), and Venda (d) stations. Blue lines represent QQ lines, while grey boxes indicate interquartile ranges.

Figure 2. Level three MODWT results for minutely averaged wind speed data for CSIR (top left panel), NUST (top right panel), Venda (bottom left panel) and RVD (bottom right panel). D1–D3 denote the detailed coefficients at different decomposition levels and A3 denotes the approximate signal of

Y_{t}

.

Figure 2. Level three MODWT results for minutely averaged wind speed data for CSIR (top left panel), NUST (top right panel), Venda (bottom left panel) and RVD (bottom right panel). D1–D3 denote the detailed coefficients at different decomposition levels and A3 denotes the approximate signal of

Y_{t}

.

Figure 3. A typical NNAR (p, k) architecture consists of an input layer, a hidden layer, and an output layer [33]. The values

{{y}_{t - 1}, y_{t - 2}, . . ., y_{t - s}, y_{t - 2 s}, y_{t - p}}

represent the lagged inputs of order

p

with

s

being the seasonality multiple. Number of neurons in the hidden layer are denoted by

k

and the resultant output at time

t

is given by

y_{t}

.

Figure 3. A typical NNAR (p, k) architecture consists of an input layer, a hidden layer, and an output layer [33]. The values

{{y}_{t - 1}, y_{t - 2}, . . ., y_{t - s}, y_{t - 2 s}, y_{t - p}}

represent the lagged inputs of order

p

with

s

being the seasonality multiple. Number of neurons in the hidden layer are denoted by

k

and the resultant output at time

t

is given by

y_{t}

.

Figure 4. Schematic representation of an LSTM cell.

Figure 5. Proposed WT-NNAR-LSTM-GBM model.

Figure 6. Model comparisons using performance metrics for CSIR (top left panel), NUST (top right panel), RVD (bottom left panel), and Venda (bottom right panel).

Figure 7. Comparison of 288 min predictions and actual wind speed data for CSIR (Top panel), NUST (Second top panel), RVD (Second bottom panel) and Venda (Bottom panel).

Figure 8. Distributions of the residuals for CSIR (top left panel), NUST (top right panel), RVD (bottom left panel), and Venda (bottom right panel).

Table 1. Location coordinates of the stations.

Station	Longitude	Latitude	Elevation (m)	Topography
CSIR	−25.746519	28.278739	1400	The roof of a building
NUST	−22.565000	17.075001	1683	The roof of the engineering building
RVD	−28.560841	16.761459	141	Inside enclosure in a desert region
Venda	−23.131001	30.423910	628	Vuwani Science Research Centre

Table 2. Details of minutely averaged wind speed datasets under experimentation.

Station	Season	Month	Sample Size	Training Set	Testing Set
CSIR	Winter	15 August 2019	1440	1152	288
NUST	Autumn	15 May 2019	1440	1152	288
RVD	Spring	7 September 2019	1440	1152	288
Venda	Summer	31 January 2019	1440	1152	288

Table 3. Summary of the descriptive statistics of the wind speed data sets (in m/s).

Station	Min	Q1	Median	Mean	St. Dev.	Q3	Max	Skewness	Kurtosis
CSIR	0.000	1.812	2.725	2.692	1.116	3.512	8.050	0.143	2.792
NUST	0.000	0.6465	1.1760	1.2772	0.861	1.7820	5.2890	0.769	3.641
RVD	0.036	3.832	7.588	7.125	3.603	10.030	14.400	−0.218	1.947
Venda	0.000	0.962	1.454	1.561	0.812	2.048	4.136	0.579	3.008

Table 4. SampEn parameter (m = 2, r =

0.2 * δ_{Y_{t}}

) for the wind speed subseries datasets.

Table 4. SampEn parameter (m = 2, r =

0.2 * δ_{Y_{t}}

) for the wind speed subseries datasets.

Station	D1 (r)	D2 (r)	D3 (r)	A3 (r)
CSIR	(0.0691)	(0.0665)	(0.0657)	(0.1905)
NUST	(0.0510)	(0.0516)	(0.0482)	(0.1485)
RVD	(0.0486)	(0.0559)	(0.0605)	(0.7141)
Venda	(0.0472)	(0.0507)	(0.0478)	(0.1386)

Table 5. Computed SampEn values for the wavelet subseries.

Station	D1	D2	D3	A3
CSIR	1.7127	1.4321	0.7617	0.3292
NUST	0.9138	0.9038	0.7038	0.3309
RVD	1.2615	1.1175	0.6430	0.0716
Venda	1.3662	1.1764	0.7288	0.3743

Table 6. Standard deviation (=

δ

) and skewness (=

ϑ

) indicators for the wind speed subseries datasets (in m/s).

Table 6. Standard deviation (=

δ

) and skewness (=

ϑ

) indicators for the wind speed subseries datasets (in m/s).

Station	$D 1 (δ$ $; ϑ$ )	$D 2 (δ$ $; ϑ$ )	$D 3 (δ$ $; ϑ$ )	$A 3 (δ$ $; ϑ$ )
CSIR	(0.3455, 0.0549)	(0.3324, 0.0946)	(0.3287, −0.1607)	(0.9525, −0.1555)
NUST	(0.2550, 0.1959)	(0.2579, 0.2359)	(0.2412, −0.0412)	(0.7424, 0.3316)
RVD	(0.2430, 0.2374)	(0.2794, 0.0740)	(0.3026, −0.5158)	(3.5706, −0.2514)
Venda	(0.2361, 0.0793)	(0.2537, 0.1026)	(0.2388, −0.0025)	(0.6928, 0.4834)

Table 7. Topology for the fitted NNAR models.

Station	D1 (p-k-1)	D2 (p-k-1)	D3 (p-k-1)	A3 (p-k-1)	$y_{t}$ (p-k-1)
CSIR	17-9-1	10-6-1	22-12-1	16-8-1	16-8-1
NUST	8-4-1	10-6-1	23-12-1	15-8-1	9-5-1
RVD	6-4-1	10-6-1	20-10-1	4-2-1	17-9-1
Venda	7-5-1	8-6-1	14-8-1	12-6-1	20-10-1

y_{t}

= actual wind speed value.

Table 9. Hyperparameter search space for GBM for the four datasets.

Hyperparameter	Values
Distribution	Gaussian
Trees	$~$ 463–812
Interaction depth	$~$ 3–7
Learning rate	$~$ 5%–6%
Loss function	root mean square error (RMSE)
Cross-validation	10%

Table 10. Model contribution to the proposed WT-NNAR-LSTM-GBM model.

Model	Goal of Each Model on the Proposed Hybrid Strategy
WTs	WTs are of paramount importance as a denoising and transform technique, as they are designed to minimise random fluctuations in the sequence of data and enhance prediction accuracy. Consequently, these techniques are endorsed for the deconstruction of wind speed data into low-frequency and several high-frequency signals.
SampEn	Besides being highly efficient and simple, SampEn can determine the randomness of a wind speed series of data without any previous knowledge of the source data. Hence, we unleash the power of SampEn and employ it to determine the level of complexity for each of the decomposed signals, thereby ensuring that the most appropriate modelling and forecasting is the approach employed for improved prediction accuracy.
NNAR	NNARs are not only employed to detect trends, but they are not prone to non-stationarity and outlier effects. Consequently, these nonlinear approximators are employed to precisely detect and model nonlinear features within those subseries that have been recognised as less random (or deterministic) through the application of the SampEn criterion.
LSTM	To circumvent the drawbacks of gradient disappearance and explosion to which the NNARs are vulnerable, those subseries that are considered to be more complex (or highly random) based on the SampEn criterion (i.e., SampEn values closer or greater than 1 (i.e., at least 0.9)), are modelled using a more robust and reliable stateless LSTM. This time series prediction task is best performed using stateless LSTMs over stateful LSTMs due to their stability and accuracy.
GBM	In addition to its robustness and scalability, a nonlinear GBM model is preferred over a linear combination (such as conventional direct summation) model for prediction fusion because it is highly accurate. In addition, it takes into account the non-linear structure of subseries forecast in the combination of predictions, thus enhancing predictive performance.

Table 11. Formulation of the hybrid models.

Model	Decomposition	Entropy	D1	D2	D3	A3	Fusion
WT-NNAR-LSTM-GBM	WT	SampEn	LSTM	LSTM	NNAR	NNAR	GBM
WT-NNAR-KNN-GBM	WT	SampEn	KNN	KNN	NNAR	NNAR	GBM
WT-KNN-LSTM-GBM	WT	SampEn	LSTM	LSTM	KNN	KNN	GBM

Table 8. Hyperparameter search space for LSTM network for the four datasets.

Hyperparameter	Values
Activation function	Hyperbolic
Number of layers	3
Loss function	MSE
Optimiser	ADAM
Learning rate	$~$ 1%–2%
Epochs (D1, D2, D3, A3, $y_{t}$ )	$~$ 25–30

y_{t}

= actual wind speed value; subseries = {D1, D2, D3, A3}.

Table 12. Residual analysis of the fitted models for four datasets.

	M1	M2	M3	M4	M5
CSIR
$e_{t} < 0$ (%)	54.8610	48.6111	51.3889	47.5694	52.0833
$e_{t} > 0$ (%)	45.1389	51.3889	48.6111	52.4306	47.9167
Std.Dev (m/s)	0.5229	0.5093	0.0934	0.1064	0.0954
Skewness (m/s)	3.6226	0.0048	−0.0472	0.0679	0.0242
AD* ( $α = 0.05$ )	<0.0001	0.06522	0.1147	0.6743	0.05911
NUST
$e_{t} < 0$ (%)	46.1806	64.5833	50.0000	49.3056	49.3056
$e_{t} > 0$ (%)	53.8194	35.4167	50.0000	50.6944	50.6944
Std.Dev (m/s)	0.1558	0.2829	0.0781	0.1118	0.0918
Skewness (m/s)	0.7140	0.7507	0.2284	0.3877	0.2161
AD* (α = 0.05)	<0.0001	<0.0001	0.0043	0.0059	0.0522
RVD
$e_{t} < 0$ (%)	55.5556	52.4306	47.5694	47.5694	46.8750
$e_{t} > 0$ (%)	44.4444	47.5694	52.4306	52.4306	53.1250
Std.Dev (m/s)	0.6257	0.1913	0.1132	0.0774	0.1150
Skewness (m/s)	4.3280	0.1282	−6.2854	−0.0108	−5.8903
AD* (α = 0.05)	<0.0001	0.01416	<0.0001	0.01404	<0.0001
VENDA
$e_{t} < 0$ (%)	51.7361	57.6389	49.6528	48.2638	46.8750
$e_{t} > 0$ (%)	48.2639	42.3611	50.3472	52.4306	53.1250
Std.Dev (m/s)	0.2996	0.2668	0.1281	0.1483	0.1466
Skewness (m/s)	3.4968	0.0625	0.0669	0.1281	0.2871
AD* (α = 0.05)	<0.0001	0.00654	0.5505	0.1815	0.1412

Std.Dev = Standard deviation; AD* = Anderson–Darling test;

e_{t} = y_{t} - {\hat{y}}_{t}

Table 13. Comparative analysis of models using scoring rules and PIW.

	M1	M2	M3	M4	M5	Mean
CSIR
MAD (m/s)	0.0886	0.4808	0.0778	0.1006	0.0834	0.1662
St.Dev PIW (m/s)	0.5903	0.1373	0.0012	0.0118	0.0005	0.1482
OL (count)	25	30	27	29	27	28
CRPS (m/s)	0.4017	0.3541	0.3439	0.3358	0.3423	0.3555
NUST
MAD (m/s)	0.1106	0.1527	0.0595	0.0950	0.0829	0.1021
St.Dev PIW (m/s)	0.0950	0.4450	0.0074	0.0118	0.0247	0.1178
OL (count)	28	28	28	29	28	28
CRPS (m/s)	0.3551	0.3634	0.3540	0.3503	0.3412	0.3528
RVD
MAD (m/s)	0.0582	0.1664	0.0521	0.0640	0.0575	0.0796
St.Dev PIW (m/s)	0.7608	0.0569	0.0075	0.0283	0.0075	0.1722
OL (count)	28	26	26	30	28	28
CRPS (m/s)	0.6869	0.6368	0.6247	0.6141	0.6238	0.6373
VENDA
MAD (m/s)	0.0187	0.2390	0.1110	0.1316	0.1354	0.1045
St.Dev PIW (m/s)	0.3564	0.1485	0.0143	0.0066	0.0284	0.1108
OL (count)	28	27	27	27	32	28
CRPS (m/s)	0.3740	0.3138	0.2873	0.2746	0.2754	0.3050

OL = Outside PI limits.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sivhugwana, K.S.; Ranganai, E. Short-Term Wind Speed Prediction via Sample Entropy: A Hybridisation Approach against Gradient Disappearance and Explosion. Computation 2024, 12, 163. https://doi.org/10.3390/computation12080163

AMA Style

Sivhugwana KS, Ranganai E. Short-Term Wind Speed Prediction via Sample Entropy: A Hybridisation Approach against Gradient Disappearance and Explosion. Computation. 2024; 12(8):163. https://doi.org/10.3390/computation12080163

Chicago/Turabian Style

Sivhugwana, Khathutshelo Steven, and Edmore Ranganai. 2024. "Short-Term Wind Speed Prediction via Sample Entropy: A Hybridisation Approach against Gradient Disappearance and Explosion" Computation 12, no. 8: 163. https://doi.org/10.3390/computation12080163

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Short-Term Wind Speed Prediction via Sample Entropy: A Hybridisation Approach against Gradient Disappearance and Explosion

Abstract

1. Introduction

1.1. Overview

1.2. Literature Review

1.3. Innovations and Contributions

1.4. Paper Structure

2. Materials and Methods

2.1. Data Description

2.2. Exploratory Analysis

2.3. Wavelet Transformation

2.4. Sample Entropy

2.5. Neural Network Autoregression

2.6. Long Short-Term Memory Networks

Stateless LSTM Prediction Approach

2.7. Gradient Boosting Machines

Hyperparameter Setting for GBM

2.8. K-Nearest Neighbours

2.9. Proposed Predictive Approach

WT-NNAR-LSTM-GBM Model

2.10. Predictive Performance Assessment

2.10.1. Point Prediction Metrics

2.10.2. Probabilistic Prediction Metrics

3. Results

3.1. Evaluation of Point Predictions

3.2. Residual Analysis

3.3. Evaluation of Probabilistic Predictions

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI