1. Introduction
At present, the power system reform in China is underway, and the spot market in pilot provinces such as Guangdong and Zhejiang will be implemented [
1]. Since the electricity spot market has the characteristics of complex trading varieties, high trading frequency, and fluctuating price, the forecasting level of ultra-short-term load is significant. It can help power market members make trading decisions in the energy market, capacity market, auxiliary service market, and demand-side response market [
2]. Additionally, ultra-short-term load forecasting is beneficial to arrange the operation mode of a power network and the maintenance plan of a unit reasonably, and can improve the economic and social benefits of a power system.
Load forecasting methods are divided into two categories: Classical statistical forecasting technologies and intelligent forecasting technologies. Classical load forecasting methods mainly include exponential sliding average [
3], linear regression [
4], auto-regressive integrated moving average [
5,
6], the dynamic regression method [
7], and generalized auto-regressive conditional heteroskedastic approach [
8]. The prediction model based on statistics has a relatively simple structure and a clear prediction principle, but its prediction accuracy is low, and it is often only applicable to the case with a small amount of data. Based on the machine learning theory, the intelligent forecasting model can fit the nonlinear relationship between complex variables, thus improving the prediction effect. Common intelligent prediction methods include support vector machine technology [
9], neural network [
10,
11], random forest [
12], etc. However, these methods have strict requirements on the selection of features, requiring an experienced person to manually select the input features. In addition, these methods require high stability of sample data and take a long time to preprocess. At the same time, the existing shallow intelligent forecasting technologies mentioned above are not suitable for scenarios with a large amount of data. As the data dimension and training depth increase, it is easy to fall into local optimality and overfitting; thus, the stability of prediction cannot be guaranteed. In recent years, with the development of artificial intelligence technology, a large amount of data has been accumulated in the power system, making it possible and necessary to apply intelligent methods such as deep learning for prediction [
13] and continuous developing [
14,
15,
16]. In order to improving the accuracy of deep learning network models, some studies usually choose to increase the complexity of models. However, as the number of training parameters increase, the model training time will also increase significantly. As a result, how to build a high-quality deep learning network model has become a research focus.
The main contribution of this paper is to propose an ultra-short-term load forecasting model, in which convolution, long short-term memory (LSTM), and gated recurrent unit (GRU) deep learning algorithms are integrated to predict the next load every 15 min. The convolution layer is mainly used to capture the characteristics of the data space. LSTM and GRU are used to mine the characteristics of the time dimension of the data. The combination of them can improve the feature mining ability of the model. By inputting the time point, temperature, weather condition, and historical 15-min load, after processing by the deep learning network, finally there is the output of a 15-min load curve for three consecutive days in the future. At the same time, this article has found the most suitable hyperparameters for the proposed deep learning framework through repeated debugging of the hyperparameters. The load forecasting model based on deep learning technology proposed in this paper can better process a large amount of historical data and extract key information. In the forecasting process, the nonlinear relationship between load and other data series can be well fitted. At the same time, through comparison with other models, the results show that the model proposed in this paper shows good overall performance in terms of accuracy and training time. As a consequence, this model can reflect the fluctuations of ultra-short-term load in the future properly.
The rest of this paper is organized as follows: Review of applied research on deep learning is shown in
Section 2. Then
Section 3 introduces the theory involved in the deep learning model.
Section 4 introduces the data samples, experimental environments, and the methods for preprocessing the experimental data.
Section 5 presents the structure of deep learning model, as well as model super-parameter adjustment and evaluation indicators. Comparison results of different models are demonstrated in
Section 6. Finally,
Section 7 provides a conclusion and brief discussion and summarizes the whole paper.
2. Literature Review
According to the time scale of forecast, the time span of electric load forecasting can be divided into medium- and long-term load forecasting, short-term load forecasting, and ultra-short-term load forecasting. Among them, ultra-short-term load forecasting refers to load forecasting within one hour, while short-term load forecasting refers to daily load forecasting and weekly load forecasting. The main research content of this paper is based on the unit of 15 min in the future, so this paper is a framework of ultra-short-term load demand forecasting based on deep learning.
With the development of computer science and technology, deep learning has gradually penetrated into all fields, and the ability of a neural network to extract data features has been significantly improved. The current ultra-short-term load method based on deep learning involves the LSTM [
17,
18], GRU [
19], recurrent neural network (RNN), and other methods.
However, most of the methods used in the literature are based on the traditional feedforward neural network (FNN), which cannot completely solve the defects that the traditional neural network cannot process, i.e., the related information between sequences. Some studies combine unsupervised training with supervised training for hierarchical feature learning [
20]. Hierarchical self-coding is used to learn layer by layer for the mining deep feature [
21]. In addition, the convolutional neural network (CNN) has achieved good results in image recognition [
22], communication signals [
23], and natural language processing [
24]. Transforming artificially set feature extraction into automatic generation feature extraction is the biggest advantage of CNN, which also has great prospects for the ultra-short-term load forecast. On the other hand, deep belief neural networks [
25,
26] and RNN [
27,
28] have achieved good results in wind speed prediction, photovoltaic power prediction, short-term load, and ultra-short-term load forecast. However, most of the current methods adopted in the literature are based on the traditional feedforward neural network (FNN), which does not completely solve the defect that traditional neural networks cannot process, i.e., the inter-sequence-related information.
In theory, RNN can capture long-distance dependence, but in practice, RNN face two challenges: Gradient explosion and vanishing gradient; so it is difficult for traditional RNN to learn long-term dependencies, while LSTM and GRU solve this problem perfectly. The LSTM network [
29] is a type of improved recurrent neural network with the hidden unit replaced by a gated memory cell. LSTM can realize deep memory learning of important information in historical data through state cells and three special gate structures, avoiding the gradient explosion or gradient disappearing that may be caused by general RNNs in the back propagation. As a result, LSTM performs excellently when processing and predicting time series-related data. At present, LSTM networks have been widely used in robot control, text recognition [
30], speech recognition [
31], protein homology detection, and other fields. In terms of forecast, LSTM has also gradually attracted the attention of scholars [
32,
33].
Although LSTM has a strong ability to solve long-term dependencies, the parameters of the LSTM network are four times that of the traditional RNN [
34], making the model too redundant. In 2014, another gating model, GRU, was proposed, which was applied to language translation for the first time [
35], and achieved long-term memory effects with fewer parameters. In recent years, GRU has been gradually applied by scholars in the forecast of traffic flow [
36] and energy consumption forecasts [
37]. The advantages of LSTM and GRU compared to other models in the direction of load prediction have been fully verified in the literature [
38,
39].
Therefore, this paper takes the advantages of convolution, LSTM, and GRU in processing time series data and introduces convolution to avoid overfitting. By experimenting with the load data from the user side of a city in Northern China, the feasibility of the deep neural network framework is verified.
3. Theoretical Description of the Proposed Model
The deep learning structure proposed in this paper mainly contains two parts: The functions of convolution are feature extraction and training parameter reduction to overcome the overfitting problem; LSTM and GRU are introduced to extract features across time steps. Convolution, LSTM, and GRU are three kinds of neural networks with different architectures. This section will introduce more details about these three neural networks.
3.1. Convolution
Convolution is an operation dedicated to processing data with a similar grid structure, working with three important characteristics: Sparse interactions, parameter sharing, and equivariant representations [
40,
41]. These three advantages make it possible to effectively reduce the complexity of the network and the number of training parameters.
As shown in
Figure 1, the neurons are connected to a local area in the input layer, and each neuron calculates the inner product of its own area connected to the input layer and its own weight. Finally, the convolutional layer calculates the output of all neurons. The pooled layer is usually placed behind the convolutional layer and pools the output of the convolution layer.
Different dimensions of convolution filters are used to process different types of data. One-dimensional convolution is often used in sequence models, such as natural language processing; two-dimensional convolution is applied in the field of computer vision and image processing; and three-dimensional convolution is suitable for the medical and video-processing field. The deep learning model framework constructed in this paper uses one-dimensional convolution to process time series data related to electrical load.
3.2. Long Short-Term Memory
The LSTM neural network is a special recurrent neural network (RNN), which introduces a weighted connection with memory and feedback functions. Compared with the feedforward neural network, LSTM can avoid gradient explosion and gradient disappearance, so LSTM can achieve continuous learning for longer time series [
42]. The LSTM hidden layer structure is shown in
Figure 2. The core of the LSTM is to store the information of the cell state and three different functional gate structures [
43], input gate, forget gate, and output gate, and memory cells of the same shape as the hidden state.
The LSTM uses two gates to control the content of the unit state C; one is the forgetting gate, which determines how much unit state is retained to the current moment Ct-1. The other is the input gate, which determines how many inputs of the network are saved to the unit state at the current moment. LSTM uses the output gate to control value the unit state has compared to the current output value of Ct.
Calculation of candidate memory cells:
Calculation of memory cells:
The calculation of the hidden state:
where W
xi, W
xf, W
xo and W
hi, W
hf, W
ho are the weight parameters, b
i, b
f, b
o are the deviation parameters,
is the output value of the network layer at the previous moment,
is the current time input value, and
are the gate structures that control whether the memory unit needs to be updated, whether it needs to be set to 0, and whether it needs to be reflected in the activation vector.
3.3. Gate Recurrent Unit
GRU is another kind of recurrent neural network (RNN). GRU and LSTM are similar in actual performance in many cases. GRU is also proposed to solve problems such as gradients in long-term memory and back propagation. Compared with LSTM, GRU can achieve considerable results, and it is easier to train, which can greatly improve training efficiency [
44]. Therefore, GRU tends to be used more in many cases.
As shown in
Figure 3, the structure of the GRU input and output is similar to that of a traditional RNN.
The GRU uses the update gate and the reset gate to update and reset the information. As shown in Equations (1) and (2), the structure is similar to that of the LSTM gate.
The input of the GRU hidden layer:
The output of the GRU hidden layer:
Calculation of memory cells:
The calculation of the hidden state:
where
,
and
,
,
are the weight parameters,
,
,
are the deviation parameters,
is the output value of the network layer at the previous moment,
t is the current time input value, and
are the gate structures that control whether the memory unit needs to be updated, whether it needs to be set to 0, and whether it needs to be reflected in the activation vector.
Compared with LSTM, the GRU has one less “gating” inside, and the parameters are less than LSTM, but it can also achieve the same function as LSTM. As a result, GRU is more practical sometimes. Therefore, the ability to learn the time series of GRU is greatly superior [
45].
5. Deep Learning Model
5.1. Deep Learning Network Prediction Framework
The deep learning framework constructed in this paper consists of two convolutional layers, one LSTM layer and one GRU layer.
As the
Figure 4 shows, firstly, the historical meteorological data and the load data are pre-processed and combined, and then the overall time series is sampled. Then, the convolution filter is used to extract higher-order sample features and reduce the number of training parameters. The Relu function [
21] is used as the activation function. Next, the LSTM layer or GRU layer is used for time series-based modeling, and the dropout layer is introduced after each layer to reduce the risk of overfitting. Finally, the load prediction result is output by a dense layer.
The overall construction process of this deep learning model is as follows:
Step 1: Data preprocessing.
The input characteristics of a single moment is 4 (see in
Table 1) with a total of 105,163 training samples. Time step and batch size are adjustable hyperparameters, so the input data is stored in a 3-dimensional tensor (batch size * Time step * 4).
Step 2: Model training.
Eighty percent of the sample data is set as the training set, 20% of the sample data is set as the test set, then the processed training set data is input into the deep learning model for training. Then model outputs the next four consecutive 15 min, which is one hour of load forecast.
Step 3: Adjust the model hyperparameters.
Continue to optimize the model and compare the accuracy using different hyperparameters models.
5.2. Hyperparameters of Deep Learning Model
In order to obtain the optimal structure of the above deep learning model, this paper uses the vertical comparison method to adjust the parameters of the number of hidden layer nodes, time step, and batch size of the improved RNN. When analyzing the influence of one of the parameters on the prediction result, the remaining parameters are fixed. The parameters selected throughout the experimental process are shown in the following
Table 4:
In this paper, mean square error (MSE) is used for error evaluation. The expressions are as follows:
MSE is a convenient method to measure the “average error “. MSE can evaluate the degree of change of the data. The smaller the value of the MSE, the better the accuracy of the prediction model to describe the experimental data. Where represents the actual load value, represents the load forecast value, and n represents the number of load forecast points. The value of n in this deep learning model is 4.
According to
Figure 5, the epoch of the training process is 5, and each training basically converges in the second epoch model. According to the trend in the figure, it can be seen that the overall error of the model is decreasing, and the error is already in the acceptable range.
The final experimental scene has a tendency to fit, so the model training is stopped, and the optimal model parameters are obtained as shown in
Table 5.
5.3. Evaluation Index
In order to test the prediction effect of the model, it is necessary to select the appropriate evaluation criteria. This paper uses the coefficient of determination to evaluate, denoted by
, and the expression is as follows:
is generally the best measure of linear regression, usually indicating the quality of the model. ranges between 0 and 1; the closer to 1, the better. represents the actual load value, represents the load forecast value, represents the actual load average, and n represents the number of load forecast points. The closer is to 1, the higher the goodness of fit is.
6. Results
In order to verify the superiority of the proposed model, this section describes the model training process in detail. The model proposed in this paper is compared with the other four deep learning models, and the details of models are as follows, in which model 5 is the abbreviation of the model proposed in this paper:
Model 1 (GRU): The preprocessed data is input to the GRU layer directly, without using a convolution filter layer. GRU layer hidden layer unit is 50;
Model 2 (LSTM): The preprocessed data is input to the LSTM layer directly, without using a convolutional layer for filtering. LSTM layer hidden layer unit is 50;
Model 3 (Conv-LSTM): The preprocessed data is input to the convolutional layer firstly for filtering, and then two LSTM layers are used for prediction. Kernel size in Conv Layer is 4 × 4. LSTM layer hidden layer unit is 50;
Model 4 (Conv-GRU): The preprocessed data is input to the convolutional layer firstly for filtering, and then two GRU layers are used for prediction. Kernel size in Conv Layer is 4 × 4. GRU layer hidden layer unit is 50;
Model 5 (Conv-GRU-LSTM): The preprocessed data is input to the convolutional layer firstly for filtering, and then a GRU layer and an LSTM layer are used for prediction. Kernel size in Conv Layer is 4 × 4. GRU and LSTM layer hidden layer unit is 50.
6.1. Training Process Analysis
In order to reflect the superiority of the proposed deep learning framework, the other two deep learning model without convolutional layer are introduced to compare with three models constructed in the framework. The five models are all trained using the optimal parameters obtained in
Table 5, and the epoch was set to 20. Training time and accuracy of the five models are demonstrated as follows.
According to
Figure 6, it can be seen whether the introduction of convolution has a great training time for deep neural networks in terms of training time, which is positively related to the number of parameters that need to be trained. Conv-GRU had the shortest training time in the five models, LSTM had the longest training time, LSTM training time was almost five times that of Conv-LSTM, and GRU training time was more than three times that of Conv-GRU.
According to
Figure 7, as the training deepens, both the LSTM and GRU models have a tendency to overfit, which may be due to the complexity of the training parameters, while the Conv-LSTM, Conv-GRU, and Conv-GRU-LSTM become more and more stable. This is because the deep learning framework proposed in this paper can greatly reduce the parameters that need to be trained while ensuring the accuracy of prediction, and ultimately reducing the cost of model training time.
6.2. Forecast Results Display
In order to further verify the superiority of the deep learning framework of this paper, the five models are used to predict the 288 consecutive point loads in the last three days in 2018. The prediction results, error, and R
2 are shown in
Figure 8 and
Figure 9 and
Table 6. The expression of error is as follows:
As
Figure 8 shows, the five deep learning models generally have splendid prediction accuracy and strong stability, proving the feasibility of applying the deep learning method to ultra-short-term load forecast. Through the calculation of R
2 value, the results of the five deep learning models were all greater than 0.9. Conv-LSTM had the best goodness of fit, and Conv-GRU-LSTM had the second goodness of fit, which further proves the superiority of the deep learning framework proposed in this paper.
According to the experimental results, although the Conv-LSTM model had the highest coefficient of determination (0.9705), judging from the model training time in
Figure 6, the training time of the Conv-GRU-LSTM model was much lower than that of the Conv-LSTM model. Therefore, comprehensively considering, the Conv-GRU-LSTM model was more practical. Especially when dealing with a large amount of sample data, the superiority of the model proposed in this paper is even more significant.
7. Conclusions and Discussion
7.1. Conclusions
With the acceleration of the power market reform process, the importance of ultra-short-term load forecasting for grid companies and emerging purchase and sale companies is becoming more apparent. At the same time, affected by many uncertain factors, the future load changes present uncertainty. In comparison with the traditional point forecasting method, the deep learning framework can actively mine the hidden information in historical data, which is conducive to the decision-making and execution of electricity purchase and sale strategies of each power trading subject, and further promotes the economics of electricity market trading.
When using large-scale data for load forecasting, the conventional prediction method always leads to an excessively complicated model and an excessive computational cost in the training process. In this paper, convolution was combined with LSTM and GRU to construct Conv-GRU-LSTM ultra-short-term load forecast models. The main research conclusions are as follows:
(1) With the use of power system big data, this paper collected more than 100,000 historical load data, making full use of the advantages of deep learning neural network to automatically extract features, simplifying the input features and reducing the process of manual construction features. The coefficient of determination of the Conv-GRU-LSTM model is 0.9639, which is very close to 1. Considering the comprehensive training time, the final experimental results show that the learning framework combining convolution with LSTM and GRU has excellent ability of feature mining.
(2) The model proposed in this paper is compared with the other four models including GRU, LSTM, Conv-GRU, and Conv-LSTM. The results show that the Conv-GRU-LSTM model proposed in this paper presents comprehensive advantages in training time and prediction accuracy.
(3) This paper aims at the short-term load forecasting in the next few minutes. The input sample has a three-year time span, so the forecasting results will not be affected by seasonal changes. Therefore, the model in this paper can be applied to short-term load forecasting in all periods of the year.
7.2. Discussion
Although the deep learning proposed in this paper can be well applied to forecast ultra-short-term load, there is still room for improvement in this paper. Further research can be carried out in the following two aspects:
(1) The model hyperparameters can be further adjusted, such as hidden layers and number of nodes. Meanwhile, the prediction model of this paper can also be generalized to photovoltaic power generation prediction and wind power prediction through hyperparameter adjusting;
(2) The deep learning framework constructed in this paper can be combined with multi-task learning as well. With reference to migration learning, and the coupling relationship of different energy sources in the integrated energy system, this model can also be introduced to improve the accuracy of multi-load prediction.