A Short-Term Load Forecasting Model Based on Crisscross Grey Wolf Optimizer and Dual-Stage Attention Mechanism

Gong, Renxi; Li, Xianglong

doi:10.3390/en16062878

Open AccessArticle

A Short-Term Load Forecasting Model Based on Crisscross Grey Wolf Optimizer and Dual-Stage Attention Mechanism

by

Renxi Gong

^1,2,* and

Xianglong Li

¹

School of Electrical Engineering, Guangxi University, Nanning 530004, China

²

School of Traffic &Transportation, Nanning University, Nanning 530200, China

^*

Author to whom correspondence should be addressed.

Energies 2023, 16(6), 2878; https://doi.org/10.3390/en16062878

Submission received: 9 February 2023 / Revised: 15 March 2023 / Accepted: 17 March 2023 / Published: 21 March 2023

(This article belongs to the Special Issue Advanced Machine Learning Applications in Modern Energy Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate short-term load forecasting is of great significance to the safe and stable operation of power systems and the development of the power market. Most existing studies apply deep learning models to make predictions considering only one feature or temporal relationship in load time series. Therefore, to obtain an accurate and reliable prediction result, a hybrid prediction model combining a dual-stage attention mechanism (DA), crisscross grey wolf optimizer (CS-GWO) and bidirectional gated recurrent unit (BiGRU) is proposed in this paper. DA is introduced on the input side of the model to improve the sensitivity of the model to key features and information at key time points simultaneously. CS-GWO is formed by combining the horizontal and vertical crossover operators, to enhance the global search ability and the diversity of the population of GWO. Meanwhile, BiGRU is optimized by CS-GWO to accelerate the convergence of the model. Finally, a collected load dataset, four evaluation metrics and parametric and non-parametric testing manners are used to evaluate the proposed CS-GWO-DA-BiGRU short-term load prediction model. The experimental results show that the RMSE, MAE and SMAPE are reduced respectively by 3.86%, 1.37% and 0.30% of those of the second-best performing CSO-DA-BiGRU model, which demonstrates that the proposed model can better fit the load data and achieve better prediction results.

Keywords:

short-term load prediction; dual-stage attention mechanism; crisscross grey wolf optimizer

1. Introduction

Electric load forecasting plays an important role in the modernization of power system management and has become the research focus of current power enterprises [1]. It can be divided into long-term, medium-term and short-term forecasting according to its different purposes [2]. Among them, short-term load forecasting can ensure the safe and stable operation of the power system and improve social benefits [3]. Furthermore, it can accelerate the development of the power market and improve economic benefits [4]. Therefore, it is of great significance to design an efficient and accurate short-term load forecasting method.

The early short-term load methods mainly use are the exponential smoothing method [5] and hidden Markov model [6], but the ability of these methods to extract nonlinear characteristics of load is weak [7]. With the rapid increase in the installation of smart meters [8] and the development of artificial intelligence technology [9], short-term load forecasting based on big data analysis has become a current research hotspot, such as the BP neural network [10], extreme learning machine [11], support vector machine [12], etc. In addition, in order to avoid falling into local minima, some scholars use swarm intelligence optimization algorithms to optimize the artificial intelligence model. For example, Niu and Dai [13] proposed a short-term load forecasting model based on modified particle swarm optimization, in which the parameters of least squares supporting a vector machine are optimized. The experimental results show that the regression accuracy and generalization ability of the model have been improved by the proposed algorithm. To address the problem of the ease in which long short-term memory neural networks fall into local minima, a whale optimization algorithm (WOA) is used to optimize the network [14]. Li et al. [15] use grey wolf optimization (GWO) to optimize the parameters of every single kernel in an extreme learning machine to improve its forecasting ability. However, GWO quickly falls into the optima trap or fails to find the global optimal solution [16].

Moreover, the above-mentioned shallow learning method is applicable in the scenario where only the historical load is used for forecasting. If it is necessary to extract deep hidden features in massive load data, a deep learning model is needed. For example, Khan et al. [17] use a convolution neural network to extract the coupling relationship of the input features. Muzaffar and Afshari [18] use a long short-term memory (LSTM) neural network to learn the temporal correlation contained in the load time series data. In addition, a gated recurrent unit (GRU) with simpler structures is also used for short-term load prediction [19], with a high-efficiency ability of feature extraction. To extract the implicit coupling relationship between features and temporal dependency in load time series, the combination of CNN and LSTM [20], or its improved variants (e.g., CNN-GRU [21], GRU-TCN [22] and RCNN-ML-LSTM [23]) are used in load prediction. However, the above deep learning models have the problems of gradient disappearance and gradient explosion [24]. Therefore, it is of great significance to avoid these problems and accelerate the convergence speed of the model, so as to improve the accuracy of load forecasting.

The above-mentioned short-term load forecasting models based on artificial intelligence techniques do not consider the importance of input features, making important features disappear with the increase of step size [25]. Feature selection is the commonly used technique to select out the most appropriate input features in forecasting problems [26]. Kong et al. [27] utilize principal component analysis to determine the major factors affecting wind speed, reducing the dimension of relative features and improving the generation of model. Li et al. [28] develop a feature selection method to choose competitive input features. The above feature selection methods reduce the number of input features only once before prediction by simple correlation analysis, which causes the potentially important input variable be discarded [29]. To address this problem, an attention mechanism is proposed [30], with the advantage of making the model handle the dependency of long time series more easily. For example, Wang et al. [31] proposed a short-term load forecasting model based on a feature attention mechanism (FA), in which the effective characteristics of the input variables are highlighted, leading to improved prediction accuracy. In Ref. [32], a temporal attention mechanism (TA) is applied in short-term load prediction to capture the high-impact time steps of load sequence, so as to further reduce the prediction errors. However, there are few research studies that focus on combining FA with TA to propose a multi-stage attention mechanism, capturing the feature and temporal relationship in load time series.

In view of the shortcomings of existing forecasting models, this paper proposes a short-term load forecasting model (DA-CS-GWO-BiGRU) based on a dual-stage attention mechanism (DA) and crisscross grey wolf optimizer algorithm (CS-GWO). The contributions of this paper are presented as follows:

Combining the advantages of a feature and temporal attention mechanism, a dual-stage attention mechanism (DA) is introduced in this paper. DA is utilized at the input side of the forecasting model to comprehensively capture the correlation relationship between various variables and temporal dependency in the load time series.
To address the deficiency of GWO, a novel crisscross grey wolf optimizer algorithm is firstly applied in a short-term load forecasting problem. By introducing horizontal and vertical crossover operators, the global search ability and community diversity of CS-GWO are improved.
The proposed DA-CS-GWO-BiGRU model is verified by using the real load data set collected in a certain area. The experimental results show that the proposed model has higher forecasting accuracy than other comparison models, and has good application prospects.

The organization of the remainder of this paper is as follows. Section 2 introduces the basic principle of the deep learning model involved in this paper. Section 3 presents the methodology of the proposed DA-CS-GWO-BiGRU short-term load forecasting model. Section 4 introduces the metrics of evaluating predictions. Section 5 focuses on the details of the experiments, and the results are analyzed and discussed. Section 6 points out the limitations of and indicates the subsequent work that could follow this paper. Finally, Section 7 summarizes this paper.

2. Principle of Deep Learning Model

2.1. BiGRU Neural Network

A recurrent neural network (RNN) can only remember short-term dependencies of a time series and is often accompanied by the problem of gradient explosion or disappearance in the training process [33], leading to its limited use in practice. By modifying the calculation method of the hidden state of RNN, GRU and LSTM can effectively strengthen the long-term dependence of time series [19]. The network structure of GRU is shown in Figure 1.

It can be seen from Figure 1 that the GRU network calculates the combination degree of the current input and the previous status information by the reset gate r_t. The calculation process is shown in Equation (1).

r_{t} = σ (x_{t} W_{r} + h_{t - 1} U_{r} + b_{r})

(1)

where

x_{t}

is the input data in t-th time step,

h_{t - 1}

is the output of the previous time step,

W_{r}

and

U_{r}

are the weight metrices of the reset gate,

b_{r}

is the bias of the reset gate and

σ

is the sigmoid activation function.

In addition, the GRU network controls the retention of the previous state information

h_{t - 1}

in the current state by update gate z_t, and its calculation process is shown in Equation (2).

z_{t} = σ (x_{t} W_{z} + h_{t - 1} U_{z} + b_{z})

(2)

where

W_{z}

and

U_{z}

are the weight metrices of the update gate and

b_{z}

is the bias of the update gate.

Next, GRU obtains candidate hidden states through the reset gate based on the updating mechanism of RNN [34], as shown in Equation (3).

{\tilde{h}}_{t} = \tanh (x_{t} W_{h} + (r_{t} ⊙ h_{t - 1}) U_{h} + b_{h})

(3)

where

W_{h}

and

U_{h}

are the weight metrices of candidate output and

b_{h}

is the bias of the candidate output.

Finally, GRU obtains new hidden state

h_{t}

by considering the previous hidden state

h_{t - 1}

and candidate hidden state

{\tilde{h}}_{t}

as well as z_t. The calculation process is shown in Equation (4).

h_{t} = z_{t} ⊙ h_{t - 1} + (1 - z_{t}) ⊙ {\tilde{h}}_{t - 1}

(4)

It can be seen from Equations (1)–(4) that GRU only considers the influence factors on the current time, and lacks consideration of future influence factors [35]. Meanwhile, a recently proposed neural network named bidirectional gated recurrent unit (BiGRU) can effectively make up for the deficiency of GRU. BiGRU can fully excavate the influence relationship hidden in the time series before and after through its unique forward and backward propagation network structure [36]. The network structure of BiGRU is shown in Figure 2, and its calculation process is shown in Equations (5)–(7).

{\vec{h}}_{t} = G (x_{t}, {\vec{h}}_{t - 1})

(5)

{\overset{\leftarrow}{h}}_{t} = G (x_{t}, {\overset{\leftarrow}{h}}_{t - 1})

(6)

h_{t} = {\vec{w}}_{t} {\vec{c}}_{t} + {\overset{\leftarrow}{w}}_{t} {\overset{\leftarrow}{c}}_{t} + b_{t}

(7)

where

{\vec{h}}_{t}

is the state information of forward propagation,

{\overset{\leftarrow}{h}}_{t}

is the state information of backward propagation,

{\vec{w}}_{t}

is the weight metrices of the hidden layer in forward propagation,

{\overset{\leftarrow}{w}}_{t}

is the weight metrices of the hidden layer in backward propagation,

b_{t}

is the bias of the hidden layer and

G (\cdot)

is the calculation process of GRU as shown in Equations (1)–(4).

As can be seen from Figure 2, BiGRU can exploit information both from the past and the future. Therefore, this paper uses BiGRU for timing analysis.

2.2. Attention Mechanism

By simulating human visual behavior, the attention mechanism adaptively assigns different attention weights to the input features of the model to highlight the more critical influence factors [37], helping the model predict better.

The attention mechanism is mainly composed of three parts, namely, attention weight calculation, weight normalization and intermediate semantic vector calculation. Firstly, the attention weight e of different features in the model input x or at t-th time step is calculated by using a multi-layer perceptron or neural network. Then, in order to meet the requirement that the sum of attention weights is 1, the attention weight e is normalized to find

α

. Finally, the intermediate semantic vector can be obtained by considering x and

α

as shown in Equation (8).

c = α x

(8)

Therefore, this paper expects to use the attention mechanism to capture the coupling relationship between each feature and the impact of information both from the past and the future on the forecasted load value.

3. DA-CS-GWO-BiGRU Short-Term Load Forecasting Model

3.1. Mathematical Model

Set

L^{(i)} = (L (24 \cdot i - 23), \dots, L (24 \cdot i))

as the electric load time series of the previous day,

t^{(i)} = (t_{\max} (i), t_{\max} (i + 1), t_{\min} (i), t_{\min} (i + 1))

as the highest and lowest temperature of the previous day and the current day,

r^{(i)} = (r (i), r (i + 1))

as the rainfall of the previous day and the current day and

d^{(i)} = (d (i), d (i + 1))

as the weather day type of the previous day and the current day. The short-term load forecasting problem can be regarded as using the load information of the previous day

L^{(i)}

, and combining its relevant characteristics

t^{(i)}

,

r^{(i)}

and

d^{(i)}

to make a prediction of the electric load values in the current day. Let the function map of the model be

F_{θ}

, and the prediction process is shown in Equation (9).

{\hat{Y}}^{(i)} = F_{θ} (X^{(i)})

(9)

where

X^{(i)} = [L^{(i)}, t^{(i)}, r^{(i)}, d^{(i)}] = (x_{1}^{(i)}, x_{2}^{(i)}, \dots, x_{T}^{(i)}) \in ℝ^{T}

, and

{\hat{Y}}^{(i)}

represents the predicted load values of the current day.

3.2. Dual-Stage Attention Mechanism

As shown in Section 3.1, the prediction model takes historical load time series

L^{(i)}

, temperature

t^{(i)}

, rainfall

r^{(i)}

and weather day type

d^{(i)}

as inputs. According to [38], different time steps and different features of the same time step have unequal effects on the output. In order to simultaneously enhance the sensitivity of features and the temporal dimension while making short-term load predictions, this paper proposes a novel attention mechanism named dual-stage attention mechanism (DA) that combines a feature attention (FA) and a temporal attention mechanism (TA). Combining the advantages of TA and FA, DA can fully capture the relationship between variables and temporal dependence in load time series, so as to provide data support for an efficient forecasting model.

3.2.1. Feature Attention Mechanism

To highlight the more critical influence features of the input, FA is introduced. As the successor of the attention mechanism, FA also has three parts, and the calculation of the attention weight of the input features is realized through the neural network in this paper. The realization of FA is demonstrated in Figure 3 and as below.

(1) Attention weight calculation: Set

X = (x_{1}, x_{2}, \dots, x_{T}) \in ℝ^{N \times T}

as the input vector of the prediction model, where

x_{j} = {\{x_{j}^{(i)}\}}_{i = 1}^{N} \in ℝ^{N} (j \in \{1, 2, \dots, n_{f}\})

and N is the number of samples. The quantization process of the corresponding weight of each feature is shown in Equation (10).

e = σ (X W_{e} + b_{e})

(10)

Among them,

e = (e_{1}, e_{2}, \dots, e_{T})

is the unnormalized attention weight,

W_{e} \in ℝ^{T \times T}

is the trainable coefficient matrix and

b_{e} \in ℝ^{T}

is the bias.

(2) Weight normalization: In order to make the attention weight satisfy the probability distribution whose sum is 1,

e

is normalized by the softmax function, as shown in Equation (11).

α_{j} = softmax (e_{j}) = \exp (e_{j}) / \sum_{i = 1}^{T} \exp (e_{i})

(11)

(3) Intermediate semantic vector calculation: The normalized weight

α_{j}

is multiplied by the corresponding feature vector

x_{j}

to achieve the purpose of enhancing or reducing the expression of

x_{j}

. Finally, the adaptively optimized feature vector

X^{ATT}

can be obtained as shown in Equation (12).

X^{ATT} = (α_{1} x_{1}, α_{2} x_{2}, \dots, α_{T} x_{T})

(12)

It is worth noting that the attention weights are dynamically changed during the training process of the model, and the weights are determined only when the iterations converge.

3.2.2. Temporal Attention Mechanism

In order to capture the temporal correlation relationship between each time step in

X^{ATT}

and the current prediction results, TA is introduced. The adaptive extraction of features at important moments is realized by integrating TA and the BiGRU network. The implementation is shown in Figure 4 and as below.

TA also has three parts, which are the same as FA.

(1) Attention weight calculation: Taking the vector

X^{ATT}

containing the feature association relationship and the hidden state

h_{t - 1}

at the previous time step of BiGRU as the input of TA, the attention weights at t-th time step in the iterative process are quantified, as shown in Equation (13).

f_{t} = σ ([X^{ATT}; h_{t - 1}] W_{f} + b_{f})

(13)

where

W_{f} \in ℝ^{(2 T) \times n_{p}}

is the trainable coefficient matrix,

b_{f} \in ℝ^{n_{p}}

is the bias and

n_{p}

is the number of hidden elements in the last layer of BiGRU.

(2) Weight normalization: In addition,

f_{t}

is normalized using the softmax function, as shown in Equation (14).

β_{t}^{j} = \exp (f_{t}^{j}) / \sum_{i = 1}^{T} \exp (f_{t}^{i})

(14)

(3) Intermediate semantic vector calculation: In order to obtain the implicit temporal correlation relationship at t-th time step,

β_{t}^{j}

and

α_{t} x_{t}

are weighted and summed to obtain an intermediate semantic vector

d_{t}

, which contains features and temporal-related information, as shown in Equation (15).

d_{t} = \sum_{j = 1}^{T} β_{t}^{j} α_{j} x_{j}

(15)

Once the iteration of BiGRU terminates, the final hidden state

h_{T}

and the intermediate semantic vector

d_{T}

are obtained. The final prediction is taken by a single-layer feedforward network utilizing

h_{T}

and

d_{T}

, as shown in Equation (16).

Y^{'} = [h_{T}; d_{T}] W_{y} + b_{y}

(16)

where

W_{y} \in ℝ^{(2 T) \times n_{p}}

is the weight metrices of the feedforward network and

b_{y} \in ℝ^{n_{p}}

is the bias.

Assuming that all parameters of the DA-BiGRU model are

θ

, its loss function is shown in Equation (17). With the goal of minimizing this loss function, the final DA-BiGRU prediction model is obtained when the training is over.

J (Y^{'}, Y, θ) = \frac{1}{N} \sum_{i = 1}^{N} {({Y^{'}}^{(i)}, Y^{(i)})}^{2}

(17)

3.3. CS-GWO Optimization Algorithm

Facts have proved that the prediction model trained with an Adam optimizer can achieve rapid convergence in the early stage, but the learning rate is too low in the later stage of training, which may affect the effective convergence of the model and cause generalization problems [39]. Swarm intelligence optimization algorithms (e.g., PSO, GWO) are used to address the aforementioned problems; however, this leads to another problem of failing to find global optimal solutions [16]. In Ref. [40], an improved GWO algorithm created by incorporating a crisscross optimization algorithm (CSO) [41] is proposed for solving the optimal power flow problem, effectively avoiding falling into the local optimum and preventing the premature convergence. Inspired by this, this paper applies a crisscross grey wolf optimizer (CS-GWO) to optimize the DA-BiGRU model in the early stage of training, so as to further accelerate the convergence and improve the generation of the short-term load prediction model. Compared with GWO, CS-GWO achieves better global search ability and group diversity by introducing horizontal crossover and vertical crossover operators of CSO.

The implementation of CS-GWO is mainly composed of five parts, which are parameter initialization, hunting, attacking prey, horizontal crossover and vertical crossover. The implementation process is described in detail as follows.

3.3.1. Parameter Initialization

Set the population of grey wolf as

Φ = {[θ_{1}, θ_{2}, \dots, θ_{n_{M}}]}^{T} \in ℝ^{n_{M} \times D}

, where

n_{M}

is the population size and D is the population dimension. Select the individual with the best fitness value in population

Φ

as grey wolf

α

, the individual with the second-best fitness value in population

Φ

as grey wolf

β

, the individual with the third-best fitness value in population

Φ

as grey wolf

δ

, and the rest of the population

Φ

as grey wolf

ω

.

3.3.2. Hunting

Since the optimal hunting position of the wolves is unknown in the abstract search space, it is necessary to set three wolves with the strongest hunting ability to guide the hunting of the wolves. Assume that the three wolves are

α

,

β

and

ω

, respectively, and the other wolf

ω

updates its position according to the positions of the above three wolves during the iteration process, as shown in Equations (18) and (19).

\{\begin{matrix} θ_{α}^{'} = θ_{α} (t) - A_{1} \cdot |C_{1} θ_{α} (t) - θ_{ω} (t)| \\ θ_{β}^{'} = θ_{β} (t) - A_{2} \cdot |C_{2} θ_{β} (t) - θ_{ω} (t)| \\ θ_{δ}^{'} = θ_{δ} (t) - A_{3} \cdot |C_{3} θ_{δ} (t) - θ_{ω} (t)| \end{matrix}

(18)

θ_{ω} (t + 1) = \frac{θ_{α}^{'} + θ_{β}^{'} + θ_{δ}^{'}}{3}

(19)

where

θ_{α} (t)

,

θ_{β} (t)

,

θ_{δ} (t)

and

θ_{ω} (t)

are the positions of grey wolves

α

,

β

,

δ

and

ω

at t-th iteration, respectively, and

A_{1}

,

A_{2}

,

A_{3}

,

C_{1}

,

C_{2}

and

C_{3}

are synergy coefficients, which are calculated as Equations (20) and (21).

3.3.3. Attack Prey

Once the prey is at rest, grey wolves stop searching and attack the prey. To simulate this process, GWO designs a synergy coefficient A, as shown in Equation (20).

A = 2 a \cdot r_{1} - a

(20)

where

r_{1}

is a random number in the range of [0, 1], and a linearly decreases from 2 to 0 during the entire iteration process.

It can be seen from Equation (20) that GWO simulates the attack process of grey wolves. When

|A| < 1

, the wolves attack the prey; when

|A| > 1

, the wolves leave the prey alone, hoping to find better prey.

In addition, in order to avoid local optimum, GWO also designs another synergy coefficient C, as shown in Equation (21).

C = 2 r_{2}

(21)

where

r_{2}

is a random number in the range of [0, 1].

3.3.4. Horizontal Crossover

In order to improve the global search ability of GWO, horizontal crossover (HC) is used to perform arithmetic crossover operations between two different individuals in all dimensions. Assuming that the i-th parent

θ_{i}

and the j-th parent

θ_{j} (i, j \in \{1, 2, \dots, n_{M}\})

perform HC operations on the d-th dimension, respectively, their offspring can be expressed as:

\{\begin{cases} θ_{i, d}^{H C} = r_{3} \times θ_{i, d} + (1 - r_{3}) \times θ_{j, d} + C_{4} \times (θ_{i, d} - θ_{j, d}) \\ θ_{j, d}^{H C} = r_{4} \times θ_{j, d} + (1 - r_{4}) \times θ_{i, d} + C_{5} \times (θ_{j, d} - θ_{i, d}) \end{cases}

(22)

where r₃ and r₄ are uniformly distributed random values in the range of [0, 1], and C₄ and C₅ are uniformly distributed random values in the range of [−1, 1]. Once the HC is over, the new population

Φ^{H C} = {[θ_{1}^{H C}, θ_{2}^{H C}, \dots, θ_{n_{M}}^{H C}]}^{T} \in ℝ^{n_{M} \times D}

can be obtained.

3.3.5. Vertical Crossover

In order to improve the population diversity of GWO, vertical crossover (VC) is used to perform arithmetic crossover operations for all individuals between two different dimensions to generate offspring. Assuming that the

d_{1}

-th dimension and

d_{2}

-th dimension of the individual perform VC operations, the offspring can be expressed as:

θ_{i, d_{1}}^{V C} = r \times θ_{i, d_{1}} + (1 - r) \times θ_{i, d_{2}}

(23)

where r is a random value uniformly distributed in the range of [0, 1]. Once the VC is over, a new population

Φ^{V C} = {[θ_{1}^{V C}, θ_{2}^{V C}, \dots, θ_{n_{M}}^{V C}]}^{T} \in ℝ^{n_{M} \times D}

is obtained.

3.3.6. The Detailed Implementation Steps of CS-GWO

To solve the problems of gradient disappearance and gradient explosion problems in deep neural networks [42], the CS-GWO is used to optimize weights and bias

θ

of the DA-GRU model in the early stage of training, aiming to improve the generalization performance of the model. The flow chart of CS-GWO is shown in Figure 5, and the detailed implementation steps of the CS-GWO algorithm are as follows:

(1) Initialize parameters: Set the number of grey wolf populations

n_{M}

, the maximum number of iterations T and the population dimension D (the number of weights and bias of the DA-BiGRU model), and initialize the population

Φ

.

(2) Set the fitness function: Take Equation (17) as the fitness function.

(3) Determine wolves

α

,

β

and

δ

: Use Equation (17) to calculate the fitness of each individual; the individual with the best fitness value is grey wolf

α

, the second-best fitness value is grey wolf

β

and the third-best fitness value is grey wolf

ω

.

(4) Update position and synergy coefficient: Firstly, update the position of grey wolf

ω

according to Equations (18) and (19), and then update A and C according to Equations (20) and (21).

(5) Horizontal crossover: According to Equation (22), perform horizontal crossover on the parent population

Φ

to obtain the offspring population

Φ^{H C}

, and use Equation (17) to calculate the fitness of each individual. If the fitness of individual

θ_{k}

(k \in \{1, 2, \dots, n_{M}\})

in

Φ

is worse than that of individual

θ_{k}^{H C}

in

Φ^{H C}

, replace

θ_{k}

with

θ_{k}^{H C}

; otherwise, do not replace.

(6) Vertical crossover: According to Equation (23), carry out horizontal crossover on the parent population

Φ

to obtain the offspring population

Φ^{V C}

, and use Equation (17) to calculate the fitness of each individual. If the fitness of individual

θ_{k}

in

Φ

is worse than that of individual

θ_{k}^{V C}

in

Φ^{V C}

, replace

θ_{k}

with

θ_{k}^{V C}

; otherwise, do not replace.

(7) Iteration termination: If the number of iterations reaches T, the position of grey wolf

α

is used as the initial weight and threshold

θ_{α}

of the DA-GRU model; otherwise, return to step (3) and continue the iteration.

4. Evaluation Index

In order to evaluate the effectiveness of the proposed prediction model, this paper uses root mean square error (RMSE), mean absolute error (MAE), symmetric mean absolute percentage error (SMAPE) and decision coefficient (R²) to evaluate the prediction results. The definitions of the four evaluation indicators are shown in Equations (24)–(27), where RMSE, MAE and SMAPE are indicators used to describe the error between the predicted value and the real value. The smaller the value, the more accurate the prediction result. Furthermore, R² is the indicator used to assess the linear relationship between the input and output values. The larger the value, the higher the prediction accuracy.

RMSE = \sqrt{\frac{1}{n_{t e s t}} \sum_{n_{t e s t}} {(Y_{t e s t} - {\hat{Y}}_{t e s t})}^{2}}

(24)

MAE = \frac{1}{n_{t e s t}} \sum_{n_{t e s t}} |Y_{t e s t} - {\hat{Y}}_{t e s t}|

(25)

SMAPE = \frac{1}{n_{t e s t}} \sum_{n_{t e s t}} \frac{|Y_{t e s t} - {\hat{Y}}_{t e s t}|}{(Y_{t e s t} + {\hat{Y}}_{t e s t}) / 2}

(26)

R^{2} = 1 - \frac{\sum_{n_{t e s t}} {(Y_{t e s t} + {\hat{Y}}_{t e s t})}^{2}}{\sum_{n_{t e s t}} {(Y_{t e s t} - {\bar{Y}}_{t e s t})}^{2}}

(27)

where

Y_{t e s t}

and

{\hat{Y}}_{t e s t}

are the actual and predicted load values in the testing dataset, respectively,

{\bar{Y}}_{t e s t}

is the average value of the actual load value and

n_{t e s t}

is the sample number of the testing dataset.

5. Experiment and Analysis

This section will verify the effectiveness of the proposed DA-CS-GWO-BiGRU short-term load forecasting model through four evaluation indicators (RMSE, MAE, SMAPE and R²) and two experiments. In addition, in order to reduce the errors caused by the experimental operation, both experiments were performed 20 times, and the average value was taken as the final experimental result. Both experiments are based on Python 3.8 and the Keras deep learning library. The core configuration of the used computer is Intel (R) Core (TM) i5-9600K 6-core processor, 3.70 GHz operating frequency, 8 GB memory capacity and Windows 10 operating system.

Particularly, the load data used in this paper is the real sample data of a region in 2018. These sample data have a total of 365 pieces, and their time resolution is 24. In order to reduce the influence of data distribution on the experimental results, the data is randomly sorted; 300 samples are selected as the training dataset, 30 samples are used as the validation dataset and 35 samples are used as the testing dataset. These datasets are depicted in Figure 6.

5.1. Parameter Settings

The attention mechanism is realized by a single-layer fully connected neural network, the number of neurons is 32 and the activation function is softmax.

The number of neurons in the hidden layer of the prediction model based on a BP neural network is 32, the activation function is ReLU, the number of neurons in the output layer is 24 and the activation function is linear. The unit number of the models based on GRU and BiGRU is 32, the number of neurons in the output layer is 24 and the activation function is linear. The BP, GRU, BiGRU, FA-BiGRU, TA-BiGRU, and DA-BiGRU models all use the Adam optimizer, and their hyperparameters β₁, β₂ and ε are set to 0.9, 0.999 and 1 × 10⁻⁸, respectively. In addition, MSE is used as the loss function, and the number of iterations of these models is set to 500.

5.2. Case 1: The Effectiveness of the BiGRU Model and Dual-Stage Mechanism

In order to better evaluate the effectiveness of the DA-BiGRU prediction model proposed in this paper, this section verifies the superiority of the BiGRU model and effectiveness of the dual-stage attention mechanism from the aspect of short-term load prediction. Persistence, BP, GRU, BiGRU, feature-attention-mechanism-based BiGRU (FA-BiGRU) and temporal-attention-mechanism-based BiGRU (TA-BiGRU) models are compared with the DA-BiGRU model in this case. The experimental results are shown in Table 1 and Figure 7, where Figure 7 is a comparison chart between the prediction results of different models and the real values on 29–31 December 2018.

As shown in Table 1 and Figure 7, the following conclusions can be drawn:

(1) The advantages of BiGRU model:

The prediction performance of the deep learning model is the best among the single prediction models (i.e., persistence, BP, GRU and BiGRU models), and the prediction accuracy of the BiGRU model is the highest. For example, compared with the classic baseline model persistence, the RMSE, MAE and SMAPE values of the BiGRU model are reduced by 15.46%, 14.38% and 0.942%, respectively, and the R² value is increased by 1.67%. Compared with the shallow neural network BP model, the RMSE, MAE and SMAPE values of the BiGRU model are reduced by 10.56%, 9.40% and 0.531%, respectively, and the R² value is increased by 1.98%. In addition, compared with the GRU model, the BiGRU model has the best RMSE, MAE, SMAPE and R² values.

The reasons are as follows: Firstly, the machine learning model uses a large amount of historical load data for training, which can effectively capture the nonlinear relationship of load time series, and the prediction performance is improved compared with the persistence model. Secondly, the BiGRU model has a unique bidirectional propagation structure, which can link the past and future influencing factors with the current load time series so as to improve the accuracy of short-term load forecasting.

(2) The effectiveness of the dual-stage attention mechanism:

This model combined with the feature attention mechanism can automatically extract the correlation between each feature, which has the ability of reducing the prediction error of the model. Compared with the BiGRU model, the RMSE, MAE and SMAPE values of the FA-BiGRU model decreased by 4.60%, 7.92% and 7.16%, respectively, and the R² value increased by 0.65%.

This model combined with the temporal attention mechanism realizes the adaptive extraction of features at important moments, which improves the prediction stability of the model. Compared with the BiGRU model, the RMSE, MAE and SMAPE values of the TA-BiGRU model decreased by 3.77%, 3.24% and 0.98%, respectively, and the R² value increased by 0.11%.

In addition, compared with other comparison models in this case study, the proposed DA-BiGRU model has the best RMSE, MAE, MAPE and R² values. This is because this model combines feature and temporal attention mechanisms, which can improve the sensitivity of the model to key features and key time steps, and finally achieve the purpose of improving prediction accuracy.

5.3. Case 2: The Effectiveness of the CS-GWO Algorithm

The popular suite of benchmark functions in validating optimization performance, i.e., CEC 2017 [43], is utilized in this subsection to conduct extensive optimization experiments. The CEC 2017 test suite has 30 functions, which can be divided into four categories: unimodal functions (F₁–F₃), multimodal functions (F₄–F₁₀), hybrid functions (F₁₁–F₂₀) and composition functions (F₂₁–F₃₀). The ideal optimal value of each benchmark functions is 0.

Moreover, the well-known optimization algorithms (i.e., PSO, WOA, GWO and CSO) are compared with CS-GWO to evaluate the effectiveness of the CS-GWO algorithm from various perspectives, including accuracy, the Wilcoxon signed-rank test and a paired samples t-test.

5.3.1. The Setting of the Numerical Experiments

The dimension of benchmark functions is uniformly set to 30 in this subsection. For a fair comparison, the number of iterations of the swarm intelligence optimization algorithm and the number of individuals is set to 3000 and 30, respectively. The position of PSO is set in the range of [−1, 1], and the limit of speed is set in the range of [−0.5, 0.5] [44]. For WOA [45], GWO [46] and CS-GWO, the limit of individual is set in the range of [−1, 1]. Among them, the vertical crossover probability of CSO and CS-GWO is set to 60%, and the horizontal intersection crossover is set to 100% [47]. To reduce statistical errors, all the reported results in this subsection are based on 30 independent runs.

5.3.2. The Comparison of Optimization Accuracy

The above-mentioned algorithms are evaluated using the CEC 2017 test suite, and the experimental results are shown in Table 2. The reported values in Table 2 are based on the errors between the terminated values of the optimization process and the target values of the benchmark functions. To intuitively quantify the optimization ability of the metaheuristics, mean values (Mean), minimum values (Min), maximum values (Max), standard deviation (Std) and ranks (Rank) are used to evaluate the accuracy. Mean, Min and Max reveal the optimization accuracy of the algorithm, Std reveals the optimization stability and Rank is based on the Friedman test [48] to rank the optimization performance of the algorithm from the aspect of statistics. Moreover, the minimum values of Mean in each benchmark function are shown in bold.

For the unimodal and functions (i.e., F₁–F₁₀), CS-GWO achieves the best results six times, and WOA and CSO share the remaining four best results. This reveals that WOA performs well for simple low-dimensional optimization problems. For the 10 hybrid functions (i.e., F₁₁–F₂₀), CS-GWO achieves the best performance seven times. Although WOA has the best values for F₁₂ and F₁₄, it gets the second-worst values in 4 out of 10 cases. This reveals that WOA has unstable performance in solving complex problems [49,50]. For the 10 composition functions (i.e., F₂₁–F₃₀), CS-GWO achieves the best performance six times, followed by CSO’s three times. The worst rank of CSO in composition functions is three with F₂₁ and F₂₇, indicating that CSO has the ability to escape local optimum when applied in complex optimization problems [47].

Particularly, GWO never dominates in optimization of all benchmark functions, but it ranks third overall. This reveals that GWO has stable performance in solving optimization problems but easily falls into the optima trap [51]. Comprehensively speaking, CS-GWO ranks first overall among the optimization algorithms and obtains the best performance in 19 out of 30 functions, whether from the aspect of the optimization accuracy or stability.

From the above analysis, we can conclude that CS-GWO performs the best with the 30 dimensional optimization problems among all the compared algorithms. This is because CS-GWO combines the stable optimization performance of GWO and outstanding ability in finding global optima in the problem-solving space of CSO.

5.3.3. Wilcoxon Signed-Rank Test and Paired Samples t-Test

In order to further prove the validity of the CS-GWO algorithm, the parametric and non-parametric tests, which are the paired samples t-test (PSTT) [52] and Wilcoxon signed-rank test (WSRT) [53], are adopted to evaluate the difference of optimization performance between CS-GWO and the comparison algorithms in 30 benchmark functions. The null hypothesis of PSTT and WSRT are that there is no difference between two compared samples. If the difference approximately obeyed the normal distribution, PSTT is used. When this premise is not satisfied, WSRT can be selected. The results of PSTT and WSRT are shown in Table 3 and Table 4, respectively.

Two indicators including t-value and Sig. (2-tailed) can be obtained from PSTT. If the Sig. (2-tailed) is less than 0.05, it can be concluded that the CS-GWO algorithm is better than the compared algorithm. In addition, four indicators including p-value, R+, R− and winner can be obtained from WSRT. If the p-value is less than 0.05, the null hypothesis can be rejected at 5% significance level. R+ represents a mean error of the CS-GWO algorithm that is higher than that of the compared one. R− represents a mean error of the CS-GWO algorithm that is lower than that of the compared one. Finally, winner indicates whether the CS-GWO algorithm is superior to the compared algorithm, “+” indicates that the CS-GWO algorithm is better than the compared algorithm, “−” indicates that the CS-GWO algorithm is worse than the compared algorithm and “=” indicates that the performance of the two algorithms display no obvious difference.

It can be seen that most of the Sig. (2-tailed) values in Table 3 are less than 0.05, which means that there is an obvious difference between the CS-GWO algorithm and the comparison algorithms, further indicating that the CS-GWO algorithm is superior to the involved algorithms. In Table 4, most of the combinations (p-Value, R+, R, winner) are (1.734 × 10⁻⁶, 0, 465, +), revealing that the CS-GWO algorithm outperforms the comparison algorithms. Furthermore, the results of ‘+/=/−’ is 94/14/12, indicating that the CS-GWO algorithm is better than the compared algorithm in 94 out of 120 cases.

From a statistical perspective, it can be concluded that CS-GWO dominates the other compared algorithms in optimization problems with CEC 2017 test functions.

5.4. Case 3: The Effectiveness of the CS-GWO-DA-BiGRU Model

In order to verify the effectiveness of the combination of the crisscross grey wolf optimization algorithm and the DA-BiGRU model in short-term load forecasting, PSO-DA-BiGRU, WOA-DA-BiGRU and GWO-DA-BiGRU models are compared with the combined CS-GWO-DA-BiGRU model in this case. In this subsection, the number of iterations of the swarm intelligence optimization algorithm is uniformly set to 200, and the number of individuals is set to 20.

The experimental results are shown in Table 5 and Figure 8, where Figure 8 is a comparison chart between the prediction results of different models and the real values on 29–31 December 2018.

As shown in Table 2 and Figure 8, the following conclusions can be drawn:

(1) The effectiveness of the swarm intelligence optimization algorithm:

The prediction model combined with the swarm intelligence optimization algorithm has better prediction performance than the single prediction model (i.e., DA-BiGRU). For example, compared with the DA-BiGRU model, the RMSE, MAE and SMAPE values of the PSO-DA-BiGRU and GWO-DA-BiGRU models are reduced by 1.75% and 3.99%, 2.67% and 3.21% and 1.32% and 0.59%, respectively, and the R² values are increased by 0.21% and 0.53%, respectively. This is because the weights and the bias of the DA-BiGRU model are optimized by the swarm intelligence optimization algorithm in the initial stage of training, which can effectively avoid the problems of gradient disappearance and gradient explosion, and further improve the accuracy of load forecasting.

(2) The superiority of the CS-GWO algorithm:

Among all the comparison forecasting models, the proposed CS-GWO-DA-BiGRU short-term load forecasting model has the highest forecasting accuracy. For example, the RMSE, MAE and SMAPE are reduced by 3.86%, 1.37% and 0.30% from those of the second-best performing CSO-DA-BiGRU model, respectively. From the aspect of the R² value, the CS-GWO-DA-BiGRU model has an increase of 0.42% compared with the CSO-DA-BiGRU model. Therefore, the CS-GWO algorithm combined with horizontal crossover and vertical crossover operators can improve the global search ability and enhance the diversity of the population, making a great contribution to improving short-term load forecasting.

6. Discussion

In this paper, a high-precision model called CS-GWO-DA-BiGRU is presented in short-term load forecasting problems. However, the proposed model still has some shortcomings that need to be improved. The limitations and future research can be summarized as follows.

(1) The CS-GWO algorithm only focuses on improving the accuracy of short-term load prediction while ignoring the prediction stability, leading to unstable prediction when extending to new data. In the future, we plan to upgrade CS-GWO to a multi-object CS-GWO algorithm to improve the accuracy and stability of short-term load prediction simultaneously.

(2) At present, the intelligent big data platform is valuable for the improvement of the prediction model. In future work, the proposed CS-GWO-DA-BiGRU prediction model will be embedded into the intelligent big data platform to construct an intelligent load forecasting system.

7. Conclusions

Short-term load prediction is essential for the stable operation and safety management of power systems. Therefore, this paper proposes a hybrid model for short-term load prediction, named CS-GWO-DA-BiGRU, which consists of a dual-stage attention mechanism, crisscross grey wolf optimization algorithm and bidirectional gated recurrent unit. The main contributions of this paper can be concluded as follows:

(1) Different from the conventional feature mechanism applied in short-term load forecasting, this paper proposes a dual-stage attention mechanism by combining feature and temporal attention mechanisms. Based on case 1, compared with FA-BiGRU, the RMSE, MAE and SMAPE values of the DA-BiGRU model are reduced by 1.79%, 0.74% and 0.70%, respectively, and the R² value is increased by 0.21%. Therefore, DA can effectively capture the correlation relationship of input feature and temporal dependence in load time series simultaneously.

(2) By combining horizontal and vertical crossover operators, the global search ability and population diversity of GWO are enhanced. Based on the Friedman test in case 2, CS-GWO ranks first among the well-known algorithms and achieves the best results for 19 out of 30 functions in CEC 2017. In addition, CS-GWO outperforms the compared algorithms in 94 out of 120 cases based on the Wilcoxon signed-rank test. Furthermore, for the proposed CS-GWO-DA-BiGRU model in case 3, which is based on CS-GWO, the R² value has an increase of 0.42% compared with the CSO-DA-BiGRU model and has the best forecasting performance.

Author Contributions

Conceptualization, R.G.; software, X.L.; validation, X.L.; formal analysis, R.G.; investigation, R.G. and X.L.; resources, R.G.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, R.G.; visualization, X.L.; supervision, R.G.; project administration, R.G.; funding acquisition, R.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (Grant No. 61561007) and Guangxi Natural Science Foundation (Grant No. 2017GXNSFAA198168).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclatures

Abbreviations
WOA	whale optimization algorithm
GWO	grey wolf optimization
LSTM	long short-term memory
GRU	gated recurrent unit
FA	feature attention mechanism
TA	temporal attention mechanism
DA	dual-stage attention mechanism
CS-GWO	crisscross grey wolf optimizer algorithm
RNN	recurrent neural network
BiGRU	bidirectional gated recurrent unit
CSO	crisscross optimization algorithm
HC	horizontal crossover
VC	vertical crossover
RMSE	root mean square error
MAE	mean absolute error
SMAPE	symmetric mean absolute percentage error
R²	decision coefficient
Mean	mean value
Min	minimum value
Max	maximum value
Std	standard deviation
Rank	ranks
PSTT	paired samples t-test
WSRT	Wilcoxon signed-rank test
Formula symbols
$x_{t}$	input data at t-th time step
$h_{t - 1}$ , $h_{t}$ , $h_{T}$	hidden state at (t−1)-th, t-th and T-th time step
r_t	reset gate
$W_{r}$ , $U_{r}$ , $b_{r}$	weight metrices and bias of reset gate
$σ$	sigmoid activation function
z_t	update gate
$W_{z}$ , $U_{z}$ , $b_{z}$	weight metrices and bias of update gate
$W_{h}$ , $b_{h}$	weight metrices and bias of candidate output
${\tilde{h}}_{t}$	candidate hidden state
${\vec{h}}_{t}$ , ${\overset{\leftarrow}{h}}_{t}$	state information of forward and backward propagation
${\vec{w}}_{t}$ , ${\overset{\leftarrow}{w}}_{t}$	weight metrices of hidden layer in forward and backward propagation
$b_{t}$	bias of the hidden layer
$G (\cdot)$	calculation process of GRU
e	unnormalized attention weight
$α$	normalized attention weight
$c$	intermediate semantic vector
$L^{(i)}$	electric load time series of the previous day
$t^{(i)}$	highest and lowest temperature of the previous day and the current day
$r^{(i)}$	rainfall of the previous day and the current day
$d^{(i)}$	weather day type of the previous day and the current day
$F_{θ}$	function map of prediction model
$X^{(i)}$	input of prediction model
${\hat{Y}}^{(i)}$	predicted load values of the current day
N	number of samples
$W_{e}$ , $b_{e}$	weight matrix and bias in FA
$X^{ATT}$	adaptively optimized feature vector
$W_{f}$ , $b_{f}$	weight matrix and bias in TA
$n_{p}$	number of hidden elements in the last layer of BiGRU
$d_{t}$ , $d_{T}$	intermediate semantic vector in t-th and T-th iteration
$W_{y}$ , $b_{y}$	weight metrices and bias of the feedforward network in TA
$θ$	all parameters of DA-BiGRU model
$J (\cdot)$	loss function of DA-BiGRU model
$Φ$	population of grey wolf
$n_{M}$	population size
D	population dimension
$θ_{α} (t)$ , $θ_{β} (t)$ , $θ_{δ} (t)$ , $θ_{ω} (t)$	grey wolves $α$ , $β$ , $δ$ and $ω$ at t-th iteration
$A_{1}$ , $A_{2}$ , $A_{3}$ , $C_{1}$ , $C_{2}$ , $C_{3}$ , A, C	synergy coefficients
$r_{1}$ , $r_{2}$ , $r_{3}$ , $r_{4}$ , $r$	random number
$Φ^{H C}$ , $Φ^{V C}$	offspring population
T	maximum number of iterations
$Y_{t e s t}$ , ${\hat{Y}}_{t e s t}$	actual and predicted load value in testing dataset
${\bar{Y}}_{t e s t}$	average value of the actual load value
$n_{t e s t}$	sample number of the testing dataset

References

Vanting, N.B.; Ma, Z.; Jørgensen, B.N. A Scoping Review of Deep Neural Networks for Electric Load Forecasting. Energy Inform. 2021, 4, 49. [Google Scholar] [CrossRef]
Liu, Y.; Dutta, S.; Kong, A.W.K.; Yeo, C.K. An Image Inpainting Approach to Short-Term Load Forecasting. IEEE Trans. Power Syst. 2022, 38, 177–187. [Google Scholar] [CrossRef]
Li, L.; Guo, L.; Wang, J.; Peng, H. Short-Term Load Forecasting Based on Spiking Neural P Systems. Appl. Sci. 2023, 13, 792. [Google Scholar] [CrossRef]
Li, S.; Kong, X.; Yue, L.; Liu, C.; Khan, M.A.; Yang, Z.; Zhang, H. Short-Term Electrical Load Forecasting Using Hybrid Model of Manta Ray Foraging Optimization and Support Vector Regression. J. Clean. Prod. 2023, 388, 135856. [Google Scholar] [CrossRef]
Mayrink, V.; Hippert, H.S. A Hybrid Method Using Exponential Smoothing and Gradient Boosting for Electrical Short-Term Load Forecasting. In Proceedings of the 2016 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Cartagena, Colombia, 2–4 November 2016; pp. 1–6. [Google Scholar]
Wang, Y.; Kong, Y.; Tang, X.; Chen, X.; Xu, Y.; Chen, J.; Sun, S.; Guo, Y.; Chen, Y. Short-Term Industrial Load Forecasting Based on Ensemble Hidden Markov Model. IEEE Access 2020, 8, 160858–160870. [Google Scholar] [CrossRef]
Guo, L.; Wu, P.; Lou, S.; Gao, J.; Liu, Y. A Multi-Feature Extraction Technique Based on Principal Component Analysis for Nonlinear Dynamic Process Monitoring. J. Process Control 2020, 85, 159–172. [Google Scholar] [CrossRef]
Mehdipour Pirbazari, A.; Farmanbar, M.; Chakravorty, A.; Rong, C. Short-Term Load Forecasting Using Smart Meter Data: A Generalization Analysis. Processes 2020, 8, 484. [Google Scholar] [CrossRef]
Li, B.; Hou, B.; Yu, W.; Lu, X.; Yang, C. Applications of Artificial Intelligence in Intelligent Manufacturing: A Review. Front. Inf. Technol. Electron. Eng. 2017, 18, 86–96. [Google Scholar] [CrossRef]
Yu, F.; Xu, X. A Short-Term Load Forecasting Model of Natural Gas Based on Optimized Genetic Algorithm and Improved BP Neural Network. Appl. Energy 2014, 134, 102–113. [Google Scholar] [CrossRef]
Li, S.; Goel, L.; Wang, P. An Ensemble Approach for Short-Term Load Forecasting by Extreme Learning Machine. Appl. Energy 2016, 170, 22–29. [Google Scholar] [CrossRef]
Lu, H.; Azimi, M.; Iseley, T. Short-Term Load Forecasting of Urban Gas Using a Hybrid Model Based on Improved Fruit Fly Optimization Algorithm and Support Vector Machine. Energy Rep. 2019, 5, 666–677. [Google Scholar] [CrossRef]
Niu, D.; Dai, S. A Short-Term Load Forecasting Model with a Modified Particle Swarm Optimization Algorithm and Least Squares Support Vector Machine Based on the Denoising Method of Empirical Mode Decomposition and Grey Relational Analysis. Energies 2017, 10, 408. [Google Scholar] [CrossRef] [Green Version]
Shao, L.; Guo, Q.; Li, C.; Li, J.; Yan, H. Short-Term Load Forecasting Based on EEMD-WOA-LSTM Combination Model. Appl. Bionics Biomech. 2022, 2022, 2166082. [Google Scholar] [CrossRef] [PubMed]
Li, T.; Qian, Z.; He, T. Short-Term Load Forecasting with Improved CEEMDAN and GWO-Based Multiple Kernel ELM. Complexity 2020, 2020, 1209547. [Google Scholar] [CrossRef] [Green Version]
Image Segmentation of Leaf Spot Diseases on Maize Using Multi-Stage Cauchy-Enabled Grey Wolf Algorithm. Eng. Appl. Artif. Intell. 2022, 109, 104653. [CrossRef]
Khan, B.; Khalid, R.; Javed, M.U.; Javaid, S.; Ahmed, S.; Javaid, N. Short-Term Load and Price Forecasting Based on Improved Convolutional Neural Network. In Proceedings of the 2020 3rd International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), Sukkur, Pakistan, 29–30 January 2020; pp. 1–6. [Google Scholar]
Muzaffar, S.; Afshari, A. Short-Term Load Forecasts Using LSTM Networks. Energy Procedia 2019, 158, 2922–2927. [Google Scholar] [CrossRef]
Jung, S.; Moon, J.; Park, S.; Hwang, E. An Attention-Based Multilayer GRU Model for Multistep-Ahead Short-Term Load Forecasting. Sensors 2021, 21, 1639. [Google Scholar] [CrossRef]
Alhussein, M.; Aurangzeb, K.; Haider, S.I. Hybrid CNN-LSTM Model for Short-Term Individual Household Load Forecasting. IEEE Access 2020, 8, 180544–180557. [Google Scholar] [CrossRef]
Sajjad, M.; Khan, Z.A.; Ullah, A.; Hussain, T.; Ullah, W.; Lee, M.Y.; Baik, S.W. A Novel CNN-GRU-Based Hybrid Approach for Short-Term Residential Load Forecasting. IEEE Access 2020, 8, 143759–143768. [Google Scholar] [CrossRef]
Cai, C.; Li, Y.; Su, Z.; Zhu, T.; He, Y. Short-Term Electrical Load Forecasting Based on VMD and GRU-TCN Hybrid Network. Appl. Sci. 2022, 12, 6647. [Google Scholar] [CrossRef]
Alsharekh, M.F.; Habib, S.; Dewi, D.A.; Albattah, W.; Islam, M.; Albahli, S. Improving the Efficiency of Multistep Short-Term Electricity Load Forecasting via R-CNN with ML-LSTM. Sensors 2022, 22, 6913. [Google Scholar] [CrossRef]
Chen, Q.; Zhang, W.; Zhu, K.; Zhou, D.; Dai, H.; Wu, Q. A Novel Trilinear Deep Residual Network with Self-Adaptive Dropout Method for Short-Term Load Forecasting. Expert Syst. Appl. 2021, 182, 115272. [Google Scholar] [CrossRef]
Kim, S.H.; Lee, G.; Kwon, G.-Y.; Kim, D.-I.; Shin, Y.-J. Deep Learning Based on Multi-Decomposition for Short-Term Load Forecasting. Energies 2018, 11, 3433. [Google Scholar] [CrossRef] [Green Version]
Bouktif, S.; Fiaz, A.; Ouni, A.; Serhani, M.A. Optimal Deep Learning Lstm Model for Electric Load Forecasting Using Feature Selection and Genetic Algorithm: Comparison with Machine Learning Approaches. Energies 2018, 11, 1636. [Google Scholar] [CrossRef] [Green Version]
Kong, X.; Liu, X.; Shi, R.; Lee, K.Y. Wind Speed Prediction Using Reduced Support Vector Machines with Feature Selection. Neurocomputing 2015, 169, 449–456. [Google Scholar] [CrossRef]
Li, S.; Wang, P.; Goel, L. Wind Power Forecasting Using Neural Network Ensembles with Feature Selection. IEEE Trans. Sustain. Energy 2015, 6, 1447–1456. [Google Scholar] [CrossRef]
Meng, A.; Chen, S.; Ou, Z.; Ding, W.; Zhou, H.; Fan, J.; Yin, H. A Hybrid Deep Learning Architecture for Wind Power Prediction Based on Bi-Attention Mechanism and Crisscross Optimization. Energy 2022, 238, 121795. [Google Scholar] [CrossRef]
Zhang, B.; Wu, J.-L.; Chang, P.-C. A Multiple Time Series-Based Recurrent Neural Network for Short-Term Load Forecasting. Soft Comput. 2018, 22, 4099–4112. [Google Scholar] [CrossRef]
Wang, S.; Wang, X.; Wang, S.; Wang, D. Bi-Directional Long Short-Term Memory Method Based on Attention Mechanism and Rolling Update for Short-Term Load Forecasting. Int. J. Electr. Power Energy Syst. 2019, 109, 470–479. [Google Scholar] [CrossRef]
Fazlipour, Z.; Mashhour, E.; Joorabian, M. A Deep Model for Short-Term Load Forecasting Applying a Stacked Autoencoder Based on LSTM Supported by a Multi-Stage Attention Mechanism. Appl. Energy 2022, 327, 120063. [Google Scholar] [CrossRef]
Ribeiro, A.H.; Tiels, K.; Aguirre, L.A.; Schön, T. Beyond Exploding and Vanishing Gradients: Analysing RNN Training Using Attractors and Smoothness. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Online, 26–28 August 2020; pp. 2370–2380. [Google Scholar]
Dewangan, F.; Abdelaziz, A.Y.; Biswal, M. Load Forecasting Models in Smart Grid Using Smart Meter Information: A Review. Energies 2023, 16, 1404. [Google Scholar] [CrossRef]
Zhang, D.; Kabuka, M.R. Combining Weather Condition Data to Predict Traffic Flow: A GRU-Based Deep Learning Approach. IET Intell. Transp. Syst. 2018, 12, 578–585. [Google Scholar] [CrossRef]
Xuan, Y.; Si, W.; Zhu, J.; Sun, Z.; Zhao, J.; Xu, M.; Xu, S. Multi-Model Fusion Short-Term Load Forecasting Based on Random Forest Feature Selection and Hybrid Neural Network. IEEE Access 2021, 9, 69002–69009. [Google Scholar] [CrossRef]
Li, C.; Liu, D.; Wang, M.; Wang, H.; Xu, S. Detection of Outliers in Time Series Power Data Based on Prediction Errors. Energies 2023, 16, 582. [Google Scholar] [CrossRef]
Niu, Z.; Yu, Z.; Tang, W.; Wu, Q.; Reformat, M. Wind Power Forecasting Using Attention-Based Gated Recurrent Unit Network. Energy 2020, 196, 117081. [Google Scholar] [CrossRef]
Keskar, N.S.; Socher, R. Improving Generalization Performance by Switching from Adam to Sgd. arXiv 2017, arXiv:171207628. [Google Scholar]
Meng, A.; Zeng, C.; Wang, P.; Chen, D.; Zhou, T.; Zheng, X.; Yin, H. A High-Performance Crisscross Search Based Grey Wolf Optimizer for Solving Optimal Power Flow Problem. Energy 2021, 225, 120211. [Google Scholar] [CrossRef]
Meng, A.; Chen, Y.; Yin, H.; Chen, S. Crisscross Optimization Algorithm and Its Application. Knowl. Based Syst. 2014, 67, 218–229. [Google Scholar] [CrossRef]
Guo, Y.; Lu, W.; Li, X.; Huang, Q. Single Image Reflection Removal Based on Residual Attention Mechanism. Appl. Sci. 2023, 13, 1618. [Google Scholar] [CrossRef]
Mohamed, A.W.; Hadi, A.A.; Fattouh, A.M.; Jambi, K.M. LSHADE with Semi-Parameter Adaptation Hybrid with CMA-ES for Solving CEC 2017 Benchmark Problems. In Proceedings of the 2017 IEEE Congress on Evolutionary Computation (CEC), San Sebastián, Spain, 5–8 June 2017; pp. 145–152. [Google Scholar]
Zhaoyu, P.; Shengzhu, L.; Hong, Z.; Nan, Z. The Application of the Pso Based BP Network in Short-Term Load Forecasting. Phys. Procedia 2012, 24, 626–632. [Google Scholar] [CrossRef] [Green Version]
Lu, Y.; Wang, G. A Load Forecasting Model Based on Support Vector Regression with Whale Optimization Algorithm. Multimed. Tools Appl. 2023, 82, 9939–9959. [Google Scholar] [CrossRef]
Barman, M.; Dev Choudhury, N.B. A Similarity Based Hybrid GWO-SVM Method of Power System Load Forecasting for Regional Special Event Days in Anomalous Load Situations in Assam, India. Sustain. Cities Soc. 2020, 61, 102311. [Google Scholar] [CrossRef]
Meng, A.; Li, Z.; Yin, H.; Chen, S.; Guo, Z. Accelerating Particle Swarm Optimization Using Crisscross Search. Inf. Sci. 2016, 329, 52–72. [Google Scholar] [CrossRef]
Derrac, J.; García, S.; Molina, D.; Herrera, F. A Practical Tutorial on the Use of Nonparametric Statistical Tests as a Methodology for Comparing Evolutionary and Swarm Intelligence Algorithms. Swarm Evol. Comput. 2011, 1, 3–18. [Google Scholar] [CrossRef]
Hashim, F.A.; Houssein, E.H.; Hussain, K.; Mabrouk, M.S.; Al-Atabany, W. Honey Badger Algorithm: New Metaheuristic Algorithm for Solving Optimization Problems. Math. Comput. Simul. 2022, 192, 84–110. [Google Scholar] [CrossRef]
Abdel-Basset, M.; Mohamed, R.; Azeem, S.A.A.; Jameel, M.; Abouhawwash, M. Kepler Optimization Algorithm: A New Metaheuristic Algorithm Inspired by Kepler’s Laws of Planetary Motion. Knowl. Based Syst. 2023, 110454, in press. [Google Scholar] [CrossRef]
Nadimi-Shahraki, M.H.; Zamani, H.; Fatahi, A.; Mirjalili, S. MFO-SFR: An Enhanced Moth-Flame Optimization Algorithm Using an Effective Stagnation Finding and Replacing Strategy. Mathematics 2023, 11, 862. [Google Scholar] [CrossRef]
Zimmerman, D.W. Teacher’s Corner: A Note on Interpretation of the Paired-Samples t Test. J. Educ. Behav. Stat. 1997, 22, 349–360. [Google Scholar] [CrossRef]
Rey, D.; Neuhäuser, M. Wilcoxon-Signed-Rank Test. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1658–1659. [Google Scholar]

Figure 1. The network structure of GRU.

Figure 2. The network structure of BiGRU.

Figure 3. Feature attention mechanism.

Figure 4. Temporal attention mechanism.

Figure 5. The flow chart of CS-GWO.

Figure 6. Training, validation and testing datasets used in this paper: (a) training dataset, (b) validation dataset, (c) testing dataset.

Figure 7. Case study 1: comparison of forecasting results on 29–31 December 2018.

Figure 8. Case study 2: comparison of forecasting results on 29–31 December 2018.

Table 1. The experiment results of case study 1.

Prediction Model	RMSE/MW	MAE/MW	SMAPE	R²
persistence	36.679	29.258	4.810	0.900
BP	34.671	27.650	4.399	0.911
GRU	32.141	25.128	3.892	0.924
BiGRU	31.009	25.051	3.868	0.929
FA-BiGRU	29.583	23.068	3.591	0.935
TA-BiGRU	29.840	24.239	3.830	0.930
DA-BiGRU	29.053	22.897	3.566	0.937

Table 2. Comparison of CS-GWO with well-known algorithms for CEC 2017 test functions.

Functions	Metrics	PSO	WOA	GWO	CSO	CS-GWO
F₁	Mean	3.169 × 10¹¹	2.408 × 10³	3.781 × 10¹⁰	2.275 × 10¹⁰	4.875 × 10³
	Min	1.137 × 10¹¹	1.070 × 10²	5.121 × 10⁹	4.985 × 10⁹	1.000 × 10²
	Max	5.806 × 10¹¹	9.847 × 10³	8.496 × 10¹⁰	6.467 × 10¹⁰	1.771 × 10⁴
	Std	9.145 × 10¹⁰	2.219 × 10³	2.197 × 10¹⁰	1.430 × 10¹⁰	5.283 × 10³
	Rank	5	1	4	3	2
F₂	Mean	8.880 × 10²	7.076 × 10²	6.260 × 10²	6.012 × 10²	5.550 × 10²
	Min	8.253 × 10²	6.184 × 10²	5.892 × 10²	5.705 × 10²	5.129 × 10²
	Max	9.519 × 10²	8.124 × 10²	7.649 × 10²	6.601 × 10²	6.350 × 10²
	Std	3.825 × 10¹	4.662 × 10¹	3.205 × 10¹	2.046 × 10¹	3.437 × 10¹
	Rank	5	4	3	2	1
F₃	Mean	1.406 × 10⁵	5.016 × 10⁴	4.718 × 10⁴	3.481 × 10⁴	3.001 × 10²
	Min	8.673 × 10⁴	2.929 × 10⁴	3.286 × 10⁴	1.132 × 10⁴	3.000 × 10²
	Max	3.292 × 10⁵	7.088 × 10⁴	6.894 × 10⁴	6.353 × 10⁴	3.003 × 10²
	Std	6.473 × 10⁴	1.017 × 10⁴	9.881 × 10³	1.028 × 10⁴	6.049 × 10⁻²
	Rank	5	4	3	2	1
F₄	Mean	6.990 × 10³	4.687 × 10²	6.504 × 10²	5.645 × 10²	4.937 × 10²
	Min	2.951 × 10³	4.001 × 10²	5.166 × 10²	4.833 × 10²	4.641 × 10²
	Max	1.422 × 10⁴	4.911 × 10²	1.218 × 10³	7.337 × 10²	5.187 × 10²
	Std	2.613 × 10³	1.995 × 10¹	1.457 × 10²	6.179 × 10¹	1.520 × 10¹
	Rank	5	1	4	3	2
F₅	Mean	8.966 × 10²	7.087 × 10²	6.214 × 10²	5.966 × 10²	5.774 × 10²
	Min	8.469 × 10²	6.323 × 10²	5.671 × 10²	5.681 × 10²	5.202 × 10²
	Max	9.739 × 10²	7.816 × 10²	6.944 × 10²	6.708 × 10²	6.720 × 10²
	Std	3.825 × 10¹	4.073 × 10¹	2.883 × 10¹	2.248 × 10¹	4.910 × 10¹
	Rank	5	4	3	2	1
F₆	Mean	7.009 × 10²	6.730 × 10²	6.270 × 10²	6.196 × 10²	6.003 × 10²
	Min	6.795 × 10²	6.482 × 10²	6.126 × 10²	6.102 × 10²	6.000 × 10²
	Max	7.331 × 10²	7.250 × 10²	6.460 × 10²	6.366 × 10²	6.017 × 10²
	Std	1.221 × 10¹	1.364 × 10¹	8.497 × 10⁰	6.550 × 10⁰	4.418 × 10⁻¹
	Rank	5	4	3	2	1
F₇	Mean	1.397 × 10³	1.140 × 10³	8.841 × 10²	8.559 × 10²	8.710 × 10²
	Min	1.218 × 10³	1.022 × 10³	8.019 × 10²	7.960 × 10²	7.698 × 10²
	Max	1.525 × 10³	1.324 × 10³	1.076 × 10³	9.970 × 10²	8.977 × 10²
	Std	6.925 × 10¹	7.795 × 10¹	6.307 × 10¹	5.190 × 10¹	3.069 × 10¹
	Rank	5	4	3	1	2
F₈	Mean	1.103 × 10³	9.403 × 10²	9.039 × 10²	8.979 × 10²	8.742 × 10²
	Min	1.002 × 10³	9.114 × 10²	8.541 × 10²	8.479 × 10²	8.129 × 10²
	Max	1.164 × 10³	9.910 × 10²	9.464 × 10²	1.032 × 10³	9.867 × 10²
	Std	3.661 × 10¹	2.327 × 10¹	2.433 × 10¹	3.537 × 10¹	4.819 × 10¹
	Rank	5	4	3	2	1
F₉	Mean	9.729 × 10³	7.439 × 10³	2.287 × 10³	2.092 × 10³	9.002 × 10²
	Min	5.733 × 10³	3.310 × 10³	1.410 × 10³	1.035 × 10³	9.000 × 10²
	Max	1.324 × 10⁴	1.376 × 10⁴	4.608 × 10³	3.691 × 10³	9.029 × 10²
	Std	1.659 × 10³	3.347 × 10³	7.803 × 10²	7.556 × 10²	5.347 × 10⁻¹
	Rank	5	4	3	2	1
F₁₀	Mean	8.429 × 10³	6.333 × 10³	5.139 × 10³	4.483 × 10³	7.671 × 10³
	Min	6.359 × 10³	4.002 × 10³	2.849 × 10³	3.091 × 10³	6.807 × 10³
	Max	1.005 × 10⁴	9.761 × 10³	8.679 × 10³	7.903 × 10³	8.508 × 10³
	Std	8.850 × 10²	1.699 × 10³	1.674 × 10³	1.270 × 10³	4.446 × 10²
	Rank	5	3	2	1	4
F₁₁	Mean	8.717 × 10³	1.220 × 10³	2.183 × 10³	1.786 × 10³	1.159 × 10³
	Min	4.894 × 10³	1.153 × 10³	1.389 × 10³	1.276 × 10³	1.108 × 10³
	Max	1.418 × 10⁴	1.298 × 10³	4.783 × 10³	3.937 × 10³	1.217 × 10³
	Std	2.371 × 10³	3.880 × 10¹	9.575 × 10²	7.822 × 10²	3.555 × 10¹
	Rank	5	2	4	3	1
F₁₂	Mean	2.599 × 10¹⁰	1.764 × 10⁵	3.831 × 10⁸	5.773 × 10⁸	3.160 × 10⁵
	Min	2.572 × 10⁹	1.191 × 10⁴	2.075 × 10⁷	2.392 × 10⁷	2.261 × 10⁴
	Max	2.031 × 10¹¹	8.030 × 10⁵	1.507 × 10⁹	3.799 × 10⁹	1.485 × 10⁶
	Std	3.719 × 10¹⁰	1.616 × 10⁵	3.689 × 10⁸	8.210 × 10⁸	3.465 × 10⁵
	Rank	5	1	4	3	2
F₁₃	Mean	1.224 × 10¹⁰	1.733 × 10⁴	1.497 × 10⁸	4.750 × 10⁷	1.410 × 10⁴
	Min	2.027 × 10⁸	3.680 × 10³	3.990 × 10⁴	4.008 × 10⁴	1.416 × 10³
	Max	2.020 × 10¹¹	4.611 × 10⁴	1.498 × 10⁹	1.405 × 10⁹	4.460 × 10⁴
	Std	3.616 × 10¹⁰	1.109 × 10⁴	4.094 × 10⁸	2.564 × 10⁸	1.138 × 10⁴
	Rank	5	2	4	3	1
F₁₄	Mean	4.157 × 10⁶	1.219 × 10⁴	4.015 × 10⁵	1.794 × 10⁵	4.376 × 10⁴
	Min	1.956 × 10⁴	1.746 × 10³	2.666 × 10⁴	2.325 × 10³	4.285 × 10³
	Max	4.485 × 10⁷	1.370 × 10⁵	1.337 × 10⁶	9.272 × 10⁵	2.532 × 10⁵
	Std	8.338 × 10⁶	2.417 × 10⁴	4.196 × 10⁵	2.868 × 10⁵	5.106 × 10⁴
	Rank	5	1	4	3	2
F₁₅	Mean	1.424 × 10⁹	8.116 × 10³	4.410 × 10⁶	4.594 × 10⁶	3.994 × 10³
	Min	2.520 × 10⁵	1.731 × 10³	2.667 × 10⁴	1.355 × 10⁴	1.607 × 10³
	Max	1.273 × 10¹⁰	3.219 × 10⁴	3.595 × 10⁷	9.453 × 10⁷	2.100 × 10⁴
	Std	2.717 × 10⁹	8.011 × 10³	9.902 × 10⁶	1.740 × 10⁷	4.244 × 10³
	Rank	5	2	4	3	1
F₁₆	Mean	5.003 × 10³	2.990 × 10³	2.559 × 10³	2.455 × 10³	2.489 × 10³
	Min	3.196 × 10³	2.365 × 10³	2.069 × 10³	2.033 × 10³	1.700 × 10³
	Max	1.215 × 10⁴	3.871 × 10³	3.254 × 10³	3.319 × 10³	3.017 × 10³
	Std	2.024 × 10³	3.759 × 10²	3.002 × 10²	3.224 × 10²	3.925 × 10²
	Rank	5	4	3	1	2
F₁₇	Mean	5.052 × 10³	3.017 × 10³	2.460 × 10³	2.392 × 10³	2.126 × 10³
	Min	3.917 × 10³	2.269 × 10³	2.102 × 10³	1.990 × 10³	1.612 × 10³
	Max	1.022 × 10⁴	3.507 × 10³	3.213 × 10³	3.419 × 10³	3.014 × 10³
	Std	1.165 × 10³	3.432 × 10²	2.835 × 10²	3.141 × 10²	3.478 × 10²
	Rank	5	4	3	2	1
F₁₈	Mean	1.149 × 10⁸	7.835 × 10⁵	1.858 × 10⁶	1.463 × 10⁶	1.632 × 10⁵
	Min	2.326 × 10⁶	8.942 × 10⁴	5.083 × 10⁴	8.757 × 10⁴	4.111 × 10⁴
	Max	7.292 × 10⁸	2.771 × 10⁶	2.160 × 10⁷	8.673 × 10⁶	4.130 × 10⁵
	Std	2.234 × 10⁸	6.988 × 10⁵	4.064 × 10⁶	1.773 × 10⁶	8.817 × 10⁴
	Rank	5	4	3	2	1
F₁₉	Mean	1.516 × 10⁹	9.880 × 10³	1.345 × 10⁷	2.830 × 10⁶	7.321 × 10³
	Min	8.062 × 10⁶	1.979 × 10³	3.550 × 10⁴	6.985 × 10³	1.991 × 10³
	Max	2.216 × 10¹⁰	4.385 × 10⁴	3.380 × 10⁸	1.373 × 10⁷	3.052 × 10⁴
	Std	4.402 × 10⁹	9.797 × 10³	6.135 × 10⁷	3.282 × 10⁶	7.050 × 10³
	Rank	5	2	4	3	1
F₂₀	Mean	9.231 × 10³	6.463 × 10³	6.165 × 10³	5.747 × 10³	4.678 × 10³
	Min	5.555 × 10³	2.300 × 10³	4.389 × 10³	2.475 × 10³	2.300 × 10³
	Max	1.129 × 10⁴	9.968 × 10³	1.015 × 10⁴	9.513 × 10³	9.283 × 10³
	Std	1.071 × 10³	1.958 × 10³	1.440 × 10³	1.854 × 10³	3.017 × 10³
	Rank	5	4	3	2	1
F₂₁	Mean	2.697 × 10³	2.520 × 10³	2.408 × 10³	2.388 × 10³	2.403 × 10³
	Min	2.583 × 10³	2.411 × 10³	2.373 × 10³	2.341 × 10³	2.319 × 10³
	Max	2.838 × 10³	2.650 × 10³	2.523 × 10³	2.436 × 10³	2.469 × 10³
	Std	5.997 × 10¹	6.193 × 10¹	2.773 × 10¹	2.040 × 10¹	3.968 × 10¹
	Rank	5	4	2	1	3
F₂₂	Mean	9.247 × 10³	6.023 × 10³	6.002 × 10³	6.264 × 10³	3.598 × 10³
	Min	6.356 × 10³	2.300 × 10³	2.731 × 10³	2.672 × 10³	2.300 × 10³
	Max	1.083 × 10⁴	1.204 × 10⁴	1.041 × 10⁴	9.926 × 10³	9.689 × 10³
	Std	1.019 × 10³	2.548 × 10³	1.493 × 10³	2.164 × 10³	2.650 × 10³
	Rank	5	4	2	3	1
F₂₃	Mean	3.363 × 10³	3.479 × 10³	2.792 × 10³	2.757 × 10³	2.708 × 10³
	Min	3.098 × 10³	3.095 × 10³	2.726 × 10³	2.701 × 10³	2.674 × 10³
	Max	3.869 × 10³	3.882 × 10³	2.946 × 10³	2.895 × 10³	2.767 × 10³
	Std	1.838 × 10²	1.772 × 10²	5.363 × 10¹	3.990 × 10¹	2.849 × 10¹
	Rank	4	5	3	2	1
F₂₄	Mean	3.533 × 10³	3.531 × 10³	2.993 × 10³	2.947 × 10³	2.970 × 10³
	Min	3.281 × 10³	3.233 × 10³	2.899 × 10³	2.881 × 10³	2.865 × 10³
	Max	3.856 × 10³	3.882 × 10³	3.095 × 10³	3.092 × 10³	3.011 × 10³
	Std	1.543 × 10²	1.353 × 10²	5.745 × 10¹	5.943 × 10¹	3.716 × 10¹
	Rank	4	5	3	1	2
F₂₅	Mean	4.003 × 10³	2.915 × 10³	3.005 × 10³	2.960 × 10³	2.888 × 10³
	Min	3.581 × 10³	2.884 × 10³	2.930 × 10³	2.906 × 10³	2.883 × 10³
	Max	5.042 × 10³	2.948 × 10³	3.220 × 10³	3.044 × 10³	2.910 × 10³
	Std	3.285 × 10²	2.514 × 10¹	7.924 × 10¹	3.339 × 10¹	4.344 × 10⁰
	Rank	5	2	4	3	1
F₂₆	Mean	1.005 × 10⁴	6.454 × 10³	4.605 × 10³	4.554 × 10³	4.105 × 10³
	Min	7.873 × 10³	2.800 × 10³	4.114 × 10³	4.051 × 10³	3.698 × 10³
	Max	1.259 × 10⁴	1.021 × 10⁴	5.272 × 10³	5.723 × 10³	4.894 × 10³
	Std	1.211 × 10³	2.659 × 10³	3.458 × 10²	3.358 × 10²	2.572 × 10²
	Rank	5	4	2	3	1
F₂₇	Mean	3.833 × 10³	4.218 × 10³	3.200 × 10³	3.200 × 10³	3.211 × 10³
	Min	3.410 × 10³	3.644 × 10³	3.200 × 10³	3.200 × 10³	3.201 × 10³
	Max	5.619 × 10³	4.903 × 10³	3.200 × 10³	3.200 × 10³	3.221 × 10³
	Std	4.177 × 10²	3.348 × 10²	2.205 × 10⁻⁴	3.106 × 10⁻⁴	5.428 × 10⁰
	Rank	4	5	2	1	3
F₂₈	Mean	5.399 × 10³	3.157 × 10³	3.315 × 10³	3.317 × 10³	3.210 × 10³
	Min	4.195 × 10³	3.100 × 10³	3.296 × 10³	3.296 × 10³	3.100 × 10³
	Max	6.761 × 10³	3.265 × 10³	3.474 × 10³	3.465 × 10³	3.267 × 10³
	Std	6.843 × 10²	6.460 × 10¹	4.611 × 10¹	4.549 × 10¹	3.247 × 10¹
	Rank	5	1	3	4	2
F₂₉	Mean	3.507 × 10³	3.498 × 10³	2.987 × 10³	2.963 × 10³	2.953 × 10³
	Min	3.197 × 10³	3.312 × 10³	2.878 × 10³	2.877 × 10³	2.860 × 10³
	Max	3.771 × 10³	3.695 × 10³	3.111 × 10³	3.074 × 10³	2.992 × 10³
	Std	1.371 × 10²	9.597 × 10¹	6.437 × 10¹	6.320 × 10¹	3.351 × 10¹
	Rank	4	5	3	2	1
F₃₀	Mean	4.622 × 10⁹	1.995 × 10⁴	1.353 × 10⁷	3.037 × 10⁷	8.942 × 10³
	Min	6.778 × 10⁷	7.843 × 10³	3.555 × 10⁴	1.607 × 10⁴	5.375 × 10³
	Max	7.022 × 10¹⁰	4.219 × 10⁴	2.789 × 10⁸	3.440 × 10⁸	1.593 × 10⁴
	Std	1.505 × 10¹⁰	7.370 × 10³	5.074 × 10⁷	7.612 × 10⁷	3.023 × 10³
	Rank	5	2	3	4	1
Mean rank		4.883	3.183	3.117	2.316	1.500
Final rank		5	4	3	2	1

Table 3. The paired samples t-test results.

Functions	CS-GWO–PSO		CS-GWO–WOA		CS-GWO–GWO		CS-GWO–CSO
Functions	t-Value	Sig. (2-Tailed)	t-Value	Sig. (2-Tailed)	t-Value	Sig. (2-Tailed)	t-Value	Sig. (2-Tailed)
F₁	−1.898 × 10¹	6.748 × 10⁻¹⁸	2.413 × 10⁰	2.236 × 10⁻²	−8.712 × 10⁰	1.367 × 10⁻⁹	−9.427 × 10⁰	2.472 × 10−1⁰
F₂	−3.355 × 10¹	9.349 × 10⁻²⁵	−1.603 × 10¹	6.012 × 10⁻¹⁶	−7.118 × 10⁰	7.828 × 10⁻⁸	−7.771 × 10⁰	1.434 × 10⁻⁸
F₃	−3.733 × 10¹	4.521 × 10⁻²⁶	−1.447 × 10¹	8.498 × 10⁻¹⁵	−2.535 × 10⁰	1.688 × 10⁻²	−4.732 × 10⁰	5.343 × 10⁻⁵
F₄	−1.361 × 10¹	4.016 × 10⁻¹⁴	5.103 × 10⁰	1.907 × 10⁻⁵	−6.128 × 10⁰	1.122 × 10⁻⁶	−5.836 × 10⁰	2.499 × 10⁻⁶
F₅	−2.913 × 10¹	5.005 × 10⁻²³	−1.207 × 10¹	7.880 × 10⁻¹³	−2.016 × 10⁰	5.320 × 10⁻²	−4.119 × 10⁰	2.892 × 10⁻⁴
F₆	−4.482 × 10¹	2.469 × 10⁻²⁸	−2.925 × 10¹	4.471 × 10⁻²³	−1.638 × 10¹	3.381 × 10⁻¹⁶	−1.711 × 10¹	1.078 × 10⁻¹⁶
F₇	−4.000 × 10¹	6.363 × 10⁻²⁷	−1.717 × 10¹	9.852 × 10⁻¹⁷	−1.211 × 10⁰	2.358 × 10⁻¹	−1.954 × 10⁰	6.035 × 10⁻²
F₈	−1.977 × 10¹	2.240 × 10⁻¹⁸	−6.441 × 10⁰	4.791 × 10⁻⁷	−2.143 × 10⁰	4.061 × 10⁻²	−3.119 × 10⁰	4.073 × 10⁻³
F₉	−2.915 × 10¹	4.917 × 10⁻²³	−1.070 × 10¹	1.390 × 10⁻¹¹	−8.638 × 10⁰	1.639 × 10⁻⁹	−9.730 × 10⁰	1.224 × 10⁻¹⁰
F₁₀	−4.268 × 10⁰	1.926 × 10⁻⁴	3.982 × 10⁰	4.199 × 10⁻⁴	1.263 × 10¹	2.564 × 10⁻¹³	7.920 × 10⁰	9.815 × 10⁻⁹
F₁₁	−1.744 × 10¹	6.491 × 10⁻¹⁷	−6.545 × 10⁰	3.613 × 10⁻⁷	−4.355 × 10⁰	1.517 × 10⁻⁴	−5.824 × 10⁰	2.584 × 10⁻⁶
F₁₂	−3.827 × 10⁰	6.381 × 10⁻⁴	2.057 × 10⁰	4.877 × 10⁻²	−3.849 × 10⁰	6.013 × 10⁻⁴	−5.683 × 10⁰	3.811 × 10⁻⁶
F₁₃	−1.855 × 10⁰	7.384 × 10⁻²	−1.045 × 10⁰	3.045 × 10⁻¹	−1.014 × 10⁰	3.188 × 10⁻¹	−2.002 × 10⁰	5.470 × 10⁻²
F₁₄	−2.698 × 10⁰	1.150 × 10⁻²	2.856 × 10⁰	7.859 × 10⁻³	−2.508 × 10⁰	1.800 × 10⁻²	−4.504 × 10⁰	1.004 × 10⁻⁴
F₁₅	−2.870 × 10⁰	7.579 × 10⁻³	−2.481 × 10⁰	1.915 × 10⁻²	−1.445 × 10⁰	1.592 × 10⁻¹	−2.437 × 10⁰	2.118 × 10⁻²
F₁₆	−6.619 × 10⁰	2.964 × 10⁻⁷	−5.076 × 10⁰	2.056 × 10⁻⁵	−3.658 × 10⁻¹	7.172 × 10⁻¹	−1.120 × 10⁰	2.720 × 10⁻¹
F₁₇	−1.250 × 10¹	3.371 × 10⁻¹³	−9.608 × 10⁰	1.622 × 10⁻¹⁰	−2.962 × 10⁰	6.043 × 10⁻³	−4.194 × 10⁰	2.354 × 10⁻⁴
F₁₈	−1.801 × 10¹	2.752 × 10⁻¹⁷	−1.777 × 10¹	3.930 × 10⁻¹⁷	−4.058 × 10⁰	3.412 × 10⁻⁴	−4.287 × 10⁰	1.828 × 10⁻⁴
F₁₉	−1.886 × 10⁰	6.940 × 10⁻²	−1.182 × 10⁰	2.468 × 10⁻¹	−4.709 × 10⁰	5.685 × 10⁻⁵	−1.200 × 10⁰	2.399 × 10⁻¹
F₂₀	−7.822 × 10⁰	1.258 × 10⁻⁸	−2.627 × 10⁰	1.363 × 10⁻²	−1.677 × 10⁰	1.042 × 10⁻¹	−2.327 × 10⁰	2.715 × 10⁻²
F₂₁	−2.256 × 10¹	6.096 × 10⁻²⁰	−8.681 × 10⁰	1.476 × 10⁻⁹	1.939 × 10⁰	6.226 × 10⁻²	−6.014 × 10⁻¹	5.523 × 10⁻¹
F₂₂	−1.092 × 10¹	8.593 × 10⁻¹²	−3.382 × 10⁰	2.076 × 10⁻³	−4.182 × 10⁰	2.434 × 10⁻⁴	−4.385 × 10⁰	1.395 × 10⁻⁴
F₂₃	−1.887 × 10¹	7.868 × 10⁻¹⁸	−2.395 × 10¹	1.174 × 10⁻²⁰	−5.113 × 10⁰	1.853 × 10⁻⁵	−7.663 × 10⁰	1.895 × 10⁻⁸
F₂₄	−1.967 × 10¹	2.579 × 10⁻¹⁸	−2.128 × 10¹	3.048 × 10⁻¹⁹	2.003 × 10⁰	5.460 × 10⁻²	−2.219 × 10⁰	3.446 × 10⁻²
F₂₅	−1.859 × 10¹	1.176 × 10⁻¹⁷	−5.979 × 10⁰	1.688 × 10⁻⁶	−1.154 × 10¹	2.322 × 10⁻¹²	−8.043 × 10⁰	7.194 × 10⁻⁹
F₂₆	−2.662 × 10¹	6.231 × 10⁻²²	−4.748 × 10⁰	5.111 × 10⁻⁵	−6.544 × 10⁰	3.618 × 10⁻⁷	−6.443 × 10⁰	4.764 × 10⁻⁷
F₂₇	−8.182 × 10⁰	5.069 × 10⁻⁹	−1.640 × 10¹	3.285 × 10⁻¹⁶	1.129 × 10¹	3.900 × 10⁻¹²	1.129 × 10¹	3.902 × 10⁻¹²
F₂₈	−1.764 × 10¹	4.810 × 10⁻¹⁷	4.099 × 10⁰	3.058 × 10⁻⁴	−1.060 × 10¹	1.737 × 10⁻¹¹	−9.310 × 10⁰	3.254 × 10⁻¹⁰
F₂₉	−2.047 × 10¹	8.731 × 10⁻¹⁹	−2.666 × 10¹	6.021 × 10⁻²²	−6.827 × 10⁻¹	5.002 × 10⁻¹	−2.368 × 10⁰	2.476 × 10⁻²
F₃₀	−1.682 × 10⁰	1.033 × 10⁻¹	−7.369 × 10⁰	4.050 × 10⁻⁸	−2.184 × 10⁰	3.718 × 10⁻²	−1.460 × 10⁰	1.550 × 10⁻¹

Table 4. Wilcoxon signed-rank test results of CEC 2017.

Functions	CS-GWO vs. PSO				CS-GWO vs. WOA
Functions	p-Value	R+	R−	Winner	p-Value	R+	R−	Winner
F₁	1.734 × 10⁻⁶	0	465	+	6.564 × 10⁻²	322	143	=
F₂	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₃	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₄	1.734 × 10⁻⁶	0	465	+	2.163 × 10⁻⁵	439	26	−
F₅	1.734 × 10⁻⁶	0	465	+	1.921 × 10⁻⁶	1	464	+
F₆	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₇	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₈	1.734 × 10⁻⁶	0	465	+	2.843 × 10⁻⁵	29	436	+
F₉	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₁₀	5.287 × 10⁻⁴	64	401	+	9.627 × 10⁻⁴	393	72	−
F₁₁	1.734 × 10⁻⁶	0	465	+	1.238 × 10⁻⁵	20	445	+
F₁₂	1.734 × 10⁻⁶	0	465	+	8.972 × 10⁻²	315	150	+
F₁₃	1.734 × 10⁻⁶	0	465	+	1.589 × 10⁻¹	164	301	=
F₁₄	3.182 × 10⁻⁶	6	459	+	1.150 × 10⁻⁴	420	45	−
F₁₅	1.734 × 10⁻⁶	0	465	+	3.609 × 10⁻³	91	374	+
F₁₆	1.734 × 10⁻⁶	0	465	+	5.307 × 10⁻⁵	36	429	+
F₁₇	1.734 × 10⁻⁶	0	465	+	2.879 × 10⁻⁶	5	460	+
F₁₈	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₁₉	1.734 × 10⁻⁶	0	465	+	1.470 × 10⁻¹	162	303	=
F₂₀	5.216 × 10⁻⁶	11	454	+	1.480 × 10⁻²	114	351	+
F₂₁	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₂₂	2.353 × 10⁻⁶	3	462	+	3.379 × 10⁻³	90	375	+
F₂₃	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₂₄	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₂₅	1.734 × 10⁻⁶	0	465	+	1.359 × 10⁻⁴	47	418	+
F₂₆	1.734 × 10⁻⁶	0	465	+	2.613 × 10⁻⁴	55	410	+
F₂₇	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₂₈	1.734 × 10⁻⁶	0	465	+	6.639 × 10⁻⁴	398	67	−
F₂₉	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₃₀	1.734 × 10⁻⁶	0	465	+	4.286 × 10⁻⁶	9	456	+
+/=/−				30/0/0				23/3/4
Functions	CS-GWO vs. GWO				CS-GWO vs. CSO
Functions	p-Value	R+	R−	winner	p-Value	R+	R−	winner
F₁	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₂	3.182 × 10⁻⁶	6	459	+	1.127 × 10⁻⁵	19	446	+
F₃	1.359 × 10⁻⁴	47	418	+	3.327 × 10⁻²	129	336	+
F₄	1.921 × 10⁻⁶	1	464	+	3.882 × 10⁻⁶	8	457	+
F₅	4.196 × 10⁻⁴	61	404	+	7.190 × 10⁻²	145	320	=
F₆	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₇	1.470 × 10⁻¹	162	303	=	4.779 × 10⁻¹	198	267	=
F₈	8.217 × 10⁻³	104	361	+	3.872 × 10⁻²	132	333	+
F₉	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₁₀	1.238 × 10⁻⁵	445	20	−	2.879 × 10⁻⁶	460	5	−
F₁₁	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₁₂	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₁₃	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₁₄	6.892 × 10⁻⁵	39	426	+	7.865 × 10⁻²	147	318	=
F₁₅	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₁₆	3.709 × 10⁻¹	189	276	=	7.813 × 10⁻¹	219	246	=
F₁₇	3.065 × 10⁻⁴	57	408	+	1.319 × 10⁻²	112	353	+
F₁₈	6.156 × 10⁻⁴	66	399	+	3.589 × 10⁻⁴	59	406	+
F₁₉	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₂₀	2.564 × 10⁻²	124	341	+	1.986 × 10⁻¹	170	295	=
F₂₁	7.971 × 10⁻¹	245	220	=	4.950 × 10⁻²	328	137	−
F₂₂	6.156 × 10⁻⁴	66	399	+	1.484 × 10⁻³	78	387	+
F₂₃	2.879 × 10⁻⁶	5	460	+	3.405 × 10⁻⁵	31	434	+
F₂₄	5.984 × 10⁻²	141	324	=	3.872 × 10⁻²	333	132	−
F₂₅	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₂₆	3.112 × 10⁻⁵	30	435	+	4.286 × 10⁻⁶	9	456	+
F₂₇	1.734 × 10⁻⁶	465	0	−	1.734 × 10⁻⁶	465	0	−
F₂₈	1.734 × 10⁻⁶	0	465	+	1.734 × 10⁻⁶	0	465	+
F₂₉	7.190 × 10⁻²	145	320	=	8.130 × 10⁻¹	221	244	=
F₃₀	1.734 × 10⁻⁶	465	0	−	1.734 × 10⁻⁶	465	0	−
+/=/−				22/5/3	+/=/−			19/6/5

Table 5. The experiment results of case study 2.

Prediction Model	RMSE/MW	MAE/MW	SMAPE	R²
DA-BiGRU	29.053	22.897	3.566	0.937
PSO-DA-BiGRU	28.546	22.285	3.519	0.939
WOA-DA-BiGRU	28.209	22.221	3.471	0.941
GWO-DA-BiGRU	27.895	22.162	3.545	0.942
CSO-DA-BiGRU	27.194	21.255	3.347	0.945
CS-GWO-DA-BiGRU	26.144	20.963	3.337	0.949

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gong, R.; Li, X. A Short-Term Load Forecasting Model Based on Crisscross Grey Wolf Optimizer and Dual-Stage Attention Mechanism. Energies 2023, 16, 2878. https://doi.org/10.3390/en16062878

AMA Style

Gong R, Li X. A Short-Term Load Forecasting Model Based on Crisscross Grey Wolf Optimizer and Dual-Stage Attention Mechanism. Energies. 2023; 16(6):2878. https://doi.org/10.3390/en16062878

Chicago/Turabian Style

Gong, Renxi, and Xianglong Li. 2023. "A Short-Term Load Forecasting Model Based on Crisscross Grey Wolf Optimizer and Dual-Stage Attention Mechanism" Energies 16, no. 6: 2878. https://doi.org/10.3390/en16062878

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Short-Term Load Forecasting Model Based on Crisscross Grey Wolf Optimizer and Dual-Stage Attention Mechanism

Abstract

1. Introduction

2. Principle of Deep Learning Model

2.1. BiGRU Neural Network

2.2. Attention Mechanism

3. DA-CS-GWO-BiGRU Short-Term Load Forecasting Model

3.1. Mathematical Model

3.2. Dual-Stage Attention Mechanism

3.2.1. Feature Attention Mechanism

3.2.2. Temporal Attention Mechanism

3.3. CS-GWO Optimization Algorithm

3.3.1. Parameter Initialization

3.3.2. Hunting

3.3.3. Attack Prey

3.3.4. Horizontal Crossover

3.3.5. Vertical Crossover

3.3.6. The Detailed Implementation Steps of CS-GWO

4. Evaluation Index

5. Experiment and Analysis

5.1. Parameter Settings

5.2. Case 1: The Effectiveness of the BiGRU Model and Dual-Stage Mechanism

5.3. Case 2: The Effectiveness of the CS-GWO Algorithm

5.3.1. The Setting of the Numerical Experiments

5.3.2. The Comparison of Optimization Accuracy

5.3.3. Wilcoxon Signed-Rank Test and Paired Samples t-Test

5.4. Case 3: The Effectiveness of the CS-GWO-DA-BiGRU Model

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclatures

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI