1. Introduction
Electric load forecasting plays an important role in the modernization of power system management and has become the research focus of current power enterprises [
1]. It can be divided into long-term, medium-term and short-term forecasting according to its different purposes [
2]. Among them, short-term load forecasting can ensure the safe and stable operation of the power system and improve social benefits [
3]. Furthermore, it can accelerate the development of the power market and improve economic benefits [
4]. Therefore, it is of great significance to design an efficient and accurate short-term load forecasting method.
The early short-term load methods mainly use are the exponential smoothing method [
5] and hidden Markov model [
6], but the ability of these methods to extract nonlinear characteristics of load is weak [
7]. With the rapid increase in the installation of smart meters [
8] and the development of artificial intelligence technology [
9], short-term load forecasting based on big data analysis has become a current research hotspot, such as the BP neural network [
10], extreme learning machine [
11], support vector machine [
12], etc. In addition, in order to avoid falling into local minima, some scholars use swarm intelligence optimization algorithms to optimize the artificial intelligence model. For example, Niu and Dai [
13] proposed a short-term load forecasting model based on modified particle swarm optimization, in which the parameters of least squares supporting a vector machine are optimized. The experimental results show that the regression accuracy and generalization ability of the model have been improved by the proposed algorithm. To address the problem of the ease in which long short-term memory neural networks fall into local minima, a whale optimization algorithm (WOA) is used to optimize the network [
14]. Li et al. [
15] use grey wolf optimization (GWO) to optimize the parameters of every single kernel in an extreme learning machine to improve its forecasting ability. However, GWO quickly falls into the optima trap or fails to find the global optimal solution [
16].
Moreover, the above-mentioned shallow learning method is applicable in the scenario where only the historical load is used for forecasting. If it is necessary to extract deep hidden features in massive load data, a deep learning model is needed. For example, Khan et al. [
17] use a convolution neural network to extract the coupling relationship of the input features. Muzaffar and Afshari [
18] use a long short-term memory (LSTM) neural network to learn the temporal correlation contained in the load time series data. In addition, a gated recurrent unit (GRU) with simpler structures is also used for short-term load prediction [
19], with a high-efficiency ability of feature extraction. To extract the implicit coupling relationship between features and temporal dependency in load time series, the combination of CNN and LSTM [
20], or its improved variants (e.g., CNN-GRU [
21], GRU-TCN [
22] and RCNN-ML-LSTM [
23]) are used in load prediction. However, the above deep learning models have the problems of gradient disappearance and gradient explosion [
24]. Therefore, it is of great significance to avoid these problems and accelerate the convergence speed of the model, so as to improve the accuracy of load forecasting.
The above-mentioned short-term load forecasting models based on artificial intelligence techniques do not consider the importance of input features, making important features disappear with the increase of step size [
25]. Feature selection is the commonly used technique to select out the most appropriate input features in forecasting problems [
26]. Kong et al. [
27] utilize principal component analysis to determine the major factors affecting wind speed, reducing the dimension of relative features and improving the generation of model. Li et al. [
28] develop a feature selection method to choose competitive input features. The above feature selection methods reduce the number of input features only once before prediction by simple correlation analysis, which causes the potentially important input variable be discarded [
29]. To address this problem, an attention mechanism is proposed [
30], with the advantage of making the model handle the dependency of long time series more easily. For example, Wang et al. [
31] proposed a short-term load forecasting model based on a feature attention mechanism (FA), in which the effective characteristics of the input variables are highlighted, leading to improved prediction accuracy. In Ref. [
32], a temporal attention mechanism (TA) is applied in short-term load prediction to capture the high-impact time steps of load sequence, so as to further reduce the prediction errors. However, there are few research studies that focus on combining FA with TA to propose a multi-stage attention mechanism, capturing the feature and temporal relationship in load time series.
In view of the shortcomings of existing forecasting models, this paper proposes a short-term load forecasting model (DA-CS-GWO-BiGRU) based on a dual-stage attention mechanism (DA) and crisscross grey wolf optimizer algorithm (CS-GWO). The contributions of this paper are presented as follows:
Combining the advantages of a feature and temporal attention mechanism, a dual-stage attention mechanism (DA) is introduced in this paper. DA is utilized at the input side of the forecasting model to comprehensively capture the correlation relationship between various variables and temporal dependency in the load time series.
To address the deficiency of GWO, a novel crisscross grey wolf optimizer algorithm is firstly applied in a short-term load forecasting problem. By introducing horizontal and vertical crossover operators, the global search ability and community diversity of CS-GWO are improved.
The proposed DA-CS-GWO-BiGRU model is verified by using the real load data set collected in a certain area. The experimental results show that the proposed model has higher forecasting accuracy than other comparison models, and has good application prospects.
The organization of the remainder of this paper is as follows.
Section 2 introduces the basic principle of the deep learning model involved in this paper.
Section 3 presents the methodology of the proposed DA-CS-GWO-BiGRU short-term load forecasting model.
Section 4 introduces the metrics of evaluating predictions.
Section 5 focuses on the details of the experiments, and the results are analyzed and discussed.
Section 6 points out the limitations of and indicates the subsequent work that could follow this paper. Finally,
Section 7 summarizes this paper.
5. Experiment and Analysis
This section will verify the effectiveness of the proposed DA-CS-GWO-BiGRU short-term load forecasting model through four evaluation indicators (RMSE, MAE, SMAPE and R2) and two experiments. In addition, in order to reduce the errors caused by the experimental operation, both experiments were performed 20 times, and the average value was taken as the final experimental result. Both experiments are based on Python 3.8 and the Keras deep learning library. The core configuration of the used computer is Intel (R) Core (TM) i5-9600K 6-core processor, 3.70 GHz operating frequency, 8 GB memory capacity and Windows 10 operating system.
Particularly, the load data used in this paper is the real sample data of a region in 2018. These sample data have a total of 365 pieces, and their time resolution is 24. In order to reduce the influence of data distribution on the experimental results, the data is randomly sorted; 300 samples are selected as the training dataset, 30 samples are used as the validation dataset and 35 samples are used as the testing dataset. These datasets are depicted in
Figure 6.
5.1. Parameter Settings
The attention mechanism is realized by a single-layer fully connected neural network, the number of neurons is 32 and the activation function is softmax.
The number of neurons in the hidden layer of the prediction model based on a BP neural network is 32, the activation function is ReLU, the number of neurons in the output layer is 24 and the activation function is linear. The unit number of the models based on GRU and BiGRU is 32, the number of neurons in the output layer is 24 and the activation function is linear. The BP, GRU, BiGRU, FA-BiGRU, TA-BiGRU, and DA-BiGRU models all use the Adam optimizer, and their hyperparameters β1, β2 and ε are set to 0.9, 0.999 and 1 × 10−8, respectively. In addition, MSE is used as the loss function, and the number of iterations of these models is set to 500.
5.2. Case 1: The Effectiveness of the BiGRU Model and Dual-Stage Mechanism
In order to better evaluate the effectiveness of the DA-BiGRU prediction model proposed in this paper, this section verifies the superiority of the BiGRU model and effectiveness of the dual-stage attention mechanism from the aspect of short-term load prediction. Persistence, BP, GRU, BiGRU, feature-attention-mechanism-based BiGRU (FA-BiGRU) and temporal-attention-mechanism-based BiGRU (TA-BiGRU) models are compared with the DA-BiGRU model in this case. The experimental results are shown in
Table 1 and
Figure 7, where
Figure 7 is a comparison chart between the prediction results of different models and the real values on 29–31 December 2018.
(1) The advantages of BiGRU model:
The prediction performance of the deep learning model is the best among the single prediction models (i.e., persistence, BP, GRU and BiGRU models), and the prediction accuracy of the BiGRU model is the highest. For example, compared with the classic baseline model persistence, the RMSE, MAE and SMAPE values of the BiGRU model are reduced by 15.46%, 14.38% and 0.942%, respectively, and the R2 value is increased by 1.67%. Compared with the shallow neural network BP model, the RMSE, MAE and SMAPE values of the BiGRU model are reduced by 10.56%, 9.40% and 0.531%, respectively, and the R2 value is increased by 1.98%. In addition, compared with the GRU model, the BiGRU model has the best RMSE, MAE, SMAPE and R2 values.
The reasons are as follows: Firstly, the machine learning model uses a large amount of historical load data for training, which can effectively capture the nonlinear relationship of load time series, and the prediction performance is improved compared with the persistence model. Secondly, the BiGRU model has a unique bidirectional propagation structure, which can link the past and future influencing factors with the current load time series so as to improve the accuracy of short-term load forecasting.
(2) The effectiveness of the dual-stage attention mechanism:
This model combined with the feature attention mechanism can automatically extract the correlation between each feature, which has the ability of reducing the prediction error of the model. Compared with the BiGRU model, the RMSE, MAE and SMAPE values of the FA-BiGRU model decreased by 4.60%, 7.92% and 7.16%, respectively, and the R2 value increased by 0.65%.
This model combined with the temporal attention mechanism realizes the adaptive extraction of features at important moments, which improves the prediction stability of the model. Compared with the BiGRU model, the RMSE, MAE and SMAPE values of the TA-BiGRU model decreased by 3.77%, 3.24% and 0.98%, respectively, and the R2 value increased by 0.11%.
In addition, compared with other comparison models in this case study, the proposed DA-BiGRU model has the best RMSE, MAE, MAPE and R2 values. This is because this model combines feature and temporal attention mechanisms, which can improve the sensitivity of the model to key features and key time steps, and finally achieve the purpose of improving prediction accuracy.
5.3. Case 2: The Effectiveness of the CS-GWO Algorithm
The popular suite of benchmark functions in validating optimization performance, i.e., CEC 2017 [
43], is utilized in this subsection to conduct extensive optimization experiments. The CEC 2017 test suite has 30 functions, which can be divided into four categories: unimodal functions (F
1–F
3), multimodal functions (F
4–F
10), hybrid functions (F
11–F
20) and composition functions (F
21–F
30). The ideal optimal value of each benchmark functions is 0.
Moreover, the well-known optimization algorithms (i.e., PSO, WOA, GWO and CSO) are compared with CS-GWO to evaluate the effectiveness of the CS-GWO algorithm from various perspectives, including accuracy, the Wilcoxon signed-rank test and a paired samples t-test.
5.3.1. The Setting of the Numerical Experiments
The dimension of benchmark functions is uniformly set to 30 in this subsection. For a fair comparison, the number of iterations of the swarm intelligence optimization algorithm and the number of individuals is set to 3000 and 30, respectively. The position of PSO is set in the range of [−1, 1], and the limit of speed is set in the range of [−0.5, 0.5] [
44]. For WOA [
45], GWO [
46] and CS-GWO, the limit of individual is set in the range of [−1, 1]. Among them, the vertical crossover probability of CSO and CS-GWO is set to 60%, and the horizontal intersection crossover is set to 100% [
47]. To reduce statistical errors, all the reported results in this subsection are based on 30 independent runs.
5.3.2. The Comparison of Optimization Accuracy
The above-mentioned algorithms are evaluated using the CEC 2017 test suite, and the experimental results are shown in
Table 2. The reported values in
Table 2 are based on the errors between the terminated values of the optimization process and the target values of the benchmark functions. To intuitively quantify the optimization ability of the metaheuristics, mean values (Mean), minimum values (Min), maximum values (Max), standard deviation (Std) and ranks (Rank) are used to evaluate the accuracy. Mean, Min and Max reveal the optimization accuracy of the algorithm, Std reveals the optimization stability and Rank is based on the Friedman test [
48] to rank the optimization performance of the algorithm from the aspect of statistics. Moreover, the minimum values of Mean in each benchmark function are shown in bold.
For the unimodal and functions (i.e., F
1–F
10), CS-GWO achieves the best results six times, and WOA and CSO share the remaining four best results. This reveals that WOA performs well for simple low-dimensional optimization problems. For the 10 hybrid functions (i.e., F
11–F
20), CS-GWO achieves the best performance seven times. Although WOA has the best values for F
12 and F
14, it gets the second-worst values in 4 out of 10 cases. This reveals that WOA has unstable performance in solving complex problems [
49,
50]. For the 10 composition functions (i.e., F
21–F
30), CS-GWO achieves the best performance six times, followed by CSO’s three times. The worst rank of CSO in composition functions is three with F
21 and F
27, indicating that CSO has the ability to escape local optimum when applied in complex optimization problems [
47].
Particularly, GWO never dominates in optimization of all benchmark functions, but it ranks third overall. This reveals that GWO has stable performance in solving optimization problems but easily falls into the optima trap [
51]. Comprehensively speaking, CS-GWO ranks first overall among the optimization algorithms and obtains the best performance in 19 out of 30 functions, whether from the aspect of the optimization accuracy or stability.
From the above analysis, we can conclude that CS-GWO performs the best with the 30 dimensional optimization problems among all the compared algorithms. This is because CS-GWO combines the stable optimization performance of GWO and outstanding ability in finding global optima in the problem-solving space of CSO.
5.3.3. Wilcoxon Signed-Rank Test and Paired Samples t-Test
In order to further prove the validity of the CS-GWO algorithm, the parametric and non-parametric tests, which are the paired samples
t-test (PSTT) [
52] and Wilcoxon signed-rank test (WSRT) [
53], are adopted to evaluate the difference of optimization performance between CS-GWO and the comparison algorithms in 30 benchmark functions. The null hypothesis of PSTT and WSRT are that there is no difference between two compared samples. If the difference approximately obeyed the normal distribution, PSTT is used. When this premise is not satisfied, WSRT can be selected. The results of PSTT and WSRT are shown in
Table 3 and
Table 4, respectively.
Two indicators including t-value and Sig. (2-tailed) can be obtained from PSTT. If the Sig. (2-tailed) is less than 0.05, it can be concluded that the CS-GWO algorithm is better than the compared algorithm. In addition, four indicators including p-value, R+, R− and winner can be obtained from WSRT. If the p-value is less than 0.05, the null hypothesis can be rejected at 5% significance level. R+ represents a mean error of the CS-GWO algorithm that is higher than that of the compared one. R− represents a mean error of the CS-GWO algorithm that is lower than that of the compared one. Finally, winner indicates whether the CS-GWO algorithm is superior to the compared algorithm, “+” indicates that the CS-GWO algorithm is better than the compared algorithm, “−” indicates that the CS-GWO algorithm is worse than the compared algorithm and “=” indicates that the performance of the two algorithms display no obvious difference.
It can be seen that most of the Sig. (2-tailed) values in
Table 3 are less than 0.05, which means that there is an obvious difference between the CS-GWO algorithm and the comparison algorithms, further indicating that the CS-GWO algorithm is superior to the involved algorithms. In
Table 4, most of the combinations (
p-Value, R+, R, winner) are (1.734 × 10
−6, 0, 465, +), revealing that the CS-GWO algorithm outperforms the comparison algorithms. Furthermore, the results of ‘+/=/−’ is 94/14/12, indicating that the CS-GWO algorithm is better than the compared algorithm in 94 out of 120 cases.
From a statistical perspective, it can be concluded that CS-GWO dominates the other compared algorithms in optimization problems with CEC 2017 test functions.
5.4. Case 3: The Effectiveness of the CS-GWO-DA-BiGRU Model
In order to verify the effectiveness of the combination of the crisscross grey wolf optimization algorithm and the DA-BiGRU model in short-term load forecasting, PSO-DA-BiGRU, WOA-DA-BiGRU and GWO-DA-BiGRU models are compared with the combined CS-GWO-DA-BiGRU model in this case. In this subsection, the number of iterations of the swarm intelligence optimization algorithm is uniformly set to 200, and the number of individuals is set to 20.
The experimental results are shown in
Table 5 and
Figure 8, where
Figure 8 is a comparison chart between the prediction results of different models and the real values on 29–31 December 2018.
(1) The effectiveness of the swarm intelligence optimization algorithm:
The prediction model combined with the swarm intelligence optimization algorithm has better prediction performance than the single prediction model (i.e., DA-BiGRU). For example, compared with the DA-BiGRU model, the RMSE, MAE and SMAPE values of the PSO-DA-BiGRU and GWO-DA-BiGRU models are reduced by 1.75% and 3.99%, 2.67% and 3.21% and 1.32% and 0.59%, respectively, and the R2 values are increased by 0.21% and 0.53%, respectively. This is because the weights and the bias of the DA-BiGRU model are optimized by the swarm intelligence optimization algorithm in the initial stage of training, which can effectively avoid the problems of gradient disappearance and gradient explosion, and further improve the accuracy of load forecasting.
(2) The superiority of the CS-GWO algorithm:
Among all the comparison forecasting models, the proposed CS-GWO-DA-BiGRU short-term load forecasting model has the highest forecasting accuracy. For example, the RMSE, MAE and SMAPE are reduced by 3.86%, 1.37% and 0.30% from those of the second-best performing CSO-DA-BiGRU model, respectively. From the aspect of the R2 value, the CS-GWO-DA-BiGRU model has an increase of 0.42% compared with the CSO-DA-BiGRU model. Therefore, the CS-GWO algorithm combined with horizontal crossover and vertical crossover operators can improve the global search ability and enhance the diversity of the population, making a great contribution to improving short-term load forecasting.
6. Discussion
In this paper, a high-precision model called CS-GWO-DA-BiGRU is presented in short-term load forecasting problems. However, the proposed model still has some shortcomings that need to be improved. The limitations and future research can be summarized as follows.
(1) The CS-GWO algorithm only focuses on improving the accuracy of short-term load prediction while ignoring the prediction stability, leading to unstable prediction when extending to new data. In the future, we plan to upgrade CS-GWO to a multi-object CS-GWO algorithm to improve the accuracy and stability of short-term load prediction simultaneously.
(2) At present, the intelligent big data platform is valuable for the improvement of the prediction model. In future work, the proposed CS-GWO-DA-BiGRU prediction model will be embedded into the intelligent big data platform to construct an intelligent load forecasting system.
7. Conclusions
Short-term load prediction is essential for the stable operation and safety management of power systems. Therefore, this paper proposes a hybrid model for short-term load prediction, named CS-GWO-DA-BiGRU, which consists of a dual-stage attention mechanism, crisscross grey wolf optimization algorithm and bidirectional gated recurrent unit. The main contributions of this paper can be concluded as follows:
(1) Different from the conventional feature mechanism applied in short-term load forecasting, this paper proposes a dual-stage attention mechanism by combining feature and temporal attention mechanisms. Based on case 1, compared with FA-BiGRU, the RMSE, MAE and SMAPE values of the DA-BiGRU model are reduced by 1.79%, 0.74% and 0.70%, respectively, and the R2 value is increased by 0.21%. Therefore, DA can effectively capture the correlation relationship of input feature and temporal dependence in load time series simultaneously.
(2) By combining horizontal and vertical crossover operators, the global search ability and population diversity of GWO are enhanced. Based on the Friedman test in case 2, CS-GWO ranks first among the well-known algorithms and achieves the best results for 19 out of 30 functions in CEC 2017. In addition, CS-GWO outperforms the compared algorithms in 94 out of 120 cases based on the Wilcoxon signed-rank test. Furthermore, for the proposed CS-GWO-DA-BiGRU model in case 3, which is based on CS-GWO, the R2 value has an increase of 0.42% compared with the CSO-DA-BiGRU model and has the best forecasting performance.