1. Introduction
High stability and economy of the power supply as well as providing customers with high quality power are the main tasks of power systems. However, because electrical energy cannot be stored too much and the five links of electrical energy generation, transmission, distribution, transformation, and consumption must be finished at the same time, it is necessary to make the system follow the load changes to produce electrical energy, i.e., to realize the dynamic balance of electrical-energy production and consumption, otherwise the quality and economy of electrical energy cannot be guaranteed, which may even have a serious impact on the safe and stable running of the system [
1]. Load forecasting (LF) can be, generally, divided into four types: long-term, medium-term, short-term, and ultra-short-term, and the related literature surveys account for 16%, 20%, 58%, and 6% of past research efforts, respectively. This shows that STLF is a key focus and hot topic in this field [
2].
STLF has been developed through decades of research and can be divided into three main types of forecasting methods; the first category is traditional statistical methods, mainly containing linear-regression (LR) [
3], autoregression (AR) [
4], and auto-regressive-moving-average (ARMA) methods [
5]. Statistical methods are simple in structure and easy to model, but the distribution characteristics of the input data have a large impact on their model output. The second category is machine-learning approaches, including gray systems, an artificial neural network (ANN) [
6], and a support-vector machine (SVM) [
7]. The SVM algorithm can be used for linear/nonlinear problems with a low generalization-error rate and can solve high-dimensional problems in traditional algorithms, but it is slow to converge and has low accuracy when dealing with large data-volume-time series. A back-propagation (BP) neural network in the ANN method has a strong nonlinear mapping capability and can automatically extract data input and output features as well as adjust the network weights to adapt during the training process, but its speed of convergence is slow and prone to fall into local minimization, plus the features need to be manually specified for time-series data, which destroys the integrity of the time series. The third category is the combined-model-prediction method, which generally optimizes multiple hyperparameters present in the model by combining optimization algorithms, such as PSO-BP, PSO-LSTM, etc.
Short-term power-load data are usually compound time series containing its own load fluctuations and related factors, which are time series and non-linear, and it is difficult for statistical methods to model for non-linear time series; although traditional machine-learning methods can overcome this obstacle, the time-series integrity of the input information is difficult to be preserved [
8]. In recent years, with the improvement of device-computing ability, deep learning, such as a deep neural network (DNN) [
9] and a deep belief network (DBN) [
10], has been able to develop at a high speed and has become a hot spot for load-prediction research [
11], which DNN and DBN having been applied to improve the prediction accuracy compared with traditional algorithms. A recurrent neural network (RNN) is capable of processing time series of arbitrary length, in principle by using neurons with self-feedback to make the network have short-term memory, and it usually uses a gradient-descent algorithm, but the problem of gradient explosion and disappearance occurs when the input sequence is long. LSTM networks have been widely used in time-series-processing problems, by introducing gating mechanisms to improve the aforementioned problems [
12,
13,
14]. However, the values of these hyperparameters, such as the number of iterations, the learning rate, and the number of hidden layers with their internal neurons set in the LSTM model, are often not optimal, which leads to the model not achieving the best prediction results. The number of iterations and the learning rate are related to the training process and the effectiveness of the LSTM model as well as the number of hidden layers and their internal neurons that affect the fitting capacity of the LSTM [
15]. Hyperparameters are usually set by manual experience, which has poor generality and high uncertainty. Therefore, in this paper, a combination of algorithms is considered to build a prediction model.
A comprehensive analysis of the algorithm’s solution speed, stability, and convergence accuracy in the literature [
16] showed that the sparrow-search algorithm is highly competitive with the gray-wolf-optimization (GWO) algorithm, the particle-swarm-optimization (PSO) algorithm, and the gravitational-search algorithm (GSA), but the SSA suffers from the problem that the population diversity reduces late in the iteration, so is prone to fall into local extremes [
17]. In view of this, this paper makes a series of improvements to the SSA and, also, combines it with LSTM to propose an ISSA-LSTM algorithm, applying it to STLF. Finally, this paper conducts an example validation, based on the load data of a region, and also conducts an experimental comparison with various algorithms, so as to confirm the superiority of the ISSA-LSTM model.
2. Long- and Short-Term Memory Network
LSTM is a recurrent neural-network model that is improved on the basis of the recurrent neural-network (RNN) model, and the composition of RNN is shown in
Figure 1 below. The traditional neural-network model is only able to establish weight connections between layers, while the neurons of each RNN cell in RNN can also establish weight connections between them, which is the biggest difference between RNN and traditional neural networks. The neurons in the traditional RNN structure are also connected to each other with weight values. That is, as the sequence advances, the hidden layer located at the front has an effect on the later ones. Therefore, RNNs are better than other kinds of neural-network models for the temporal-sequence problem. However, RNNs cannot handle the problem of long-range dependencies and are highly susceptible to gradient disappearance and explosion, due to their directional loop for information transfer [
18]. However, LSTM solves the undesirable problems of RNN mentioned above. First, LSTM can learn on long-term dependent information. Second, since the core idea of LSTM is to include three gating units within each RNN unit, the information of key nodes is selected to be remembered or forgotten, thus well improving the problem of gradient disappearance and explosion that occurs in RNN models on long time-series problems [
19]. The left half of
Figure 2 below shows the Simple Recurrent Network (SRN) unit, and the right half shows the LSTM module used in the hidden layer of the RNN.
Taking (x1, x2,…, xt) as the series of inputs to the model, and setting the model hidden states as (h1, h2, …, ht), we have Equations (1)–(5) at moment t.
The ft, it, ot, and ct in Equations (1)–(5) denote the oblivion gate, input gate, output gate, and cell state, respectively. Wf, Wi, Wc denote the matrix weights of the corresponding gates, respectively. bf, bi, bc, bo denote the bias terms of each gate, respectively. · denotes the vector inner product, g(·) and δ(·) denote the sigmoid-function variation and tanh-function variation, respectively.
A linear-regression layer is also added to satisfy the need of implementing prediction using LSTM. The linear-regression-layer expression is set as shown in Equation (6).
The by in the above Equation (6) denotes that the threshold value of the linear-regression layer and yt denotes the result predicted by the model.
3. Sparrow Search Algorithm
The SSA is a novel intelligent-optimization algorithm recently introduced by Jiankai Xue, inspired by the predatory and anti-predatory behaviors of sparrows in biology. The size of the sparrow population is denoted by N. The sparrow-set matrix is shown in Equations (7) and (8) below. The i = (1,2,…,N) in Equation (8), and d is the dimension of the variable.
The matrix of fitness values of these sparrows is shown in Equations (9) and (10), where N denotes the number of sparrows and the value in each Fx denotes the adaptation value of an individual.
The sparrow with the better fitness value was the first to get food and acted as a discoverer, leading the entire population closer to the food location. The position of the discoverer was calculated according to Equation (11).
The t in the above Equation (11) denotes the number of current iterations, j = (1,2, ⋯, d) and
means the information of the position of the sparrow with number i in the j-th dimension. itermax means the maximum number of iterations, α is a random number within (0, 1), R2(R2 ∈ [0, 1]), and ST(ST ∈ [0.5, 1]) in sequence means the alert value and the safety value. Q denotes a random number that submits to a normal distribution of [0, 1]. L denotes the all-1 matrix of 1 × d. When R2 < ST, it means the vicinity is safe and the discoverer practices an extensive search pattern. If R2 ≥ ST, which indicates that the discoverer has detected the danger, it tells the entire population that it needs to move to another safe location.
The positions of the followers are calculated according to Equation (12).
In the above Equation (12), Xworst represents the global worst position. A denotes a 1 × d matrix with all elements arbitrarily given the values one or minus one, and, in addition, A+ = AT(AAT)−1. When i > 0.5N, it means that the i-th follower with poor adaptation has not eaten food, and, therefore, its own energy value is low. In order to replenish its own energy, it needs to go to other places to forage for food.
When foraging begins, some sparrows are selected by the population to be on guard. When danger appears nearby, the discoverers and followers will abandon the food they have found and flee to other locations. We selected 10–20% of the number of sparrows, noted as SD, and SD sparrows were arbitrarily selected from the population for early alert in each generation. The formula for calculating its position is shown in Equation (13).
Xbest in the above Equation (13) represents the best global position. β is the adjustment factor for the step size, which is a normally distributed random number with a variance value of one and a mean value of zero. k denotes a uniform random number within [−1, 1]. fi is the current sparrow fitness value. fg and fw are, in turn, the optimal- and the worst-adaptation degrees of the current global. ε denotes the minimum constant, which is mainly used to prevent the denominator from being zero. When fi > fg, it represents that the sparrow is at the outermost part of the population and is prone to encounter predators, which are more dangerous at this time. When fi = fg, it represents that the sparrow in the center of the population senses danger and will approach the other sparrows at this time. k indicates the orientation of the sparrow’s movement, and is a step adjustment factor.
4. Improved Sparrow-Search Algorithm
4.1. Sin-Chaos-Initialization Population
Chaos is frequently applied to address optimization-search problems. The Tent model and the Logistic model are the most commonly used chaotic models, but both of them are limited in the number of mapping folds. Unlike the first two models, the number of mapping folds of the sine-chaotification model is unrestricted. Haidong Yang et al. [
20] demonstrated that the Sin model has better chaotic properties than the Logistic model, so this paper uses Sin chaos for the SSA algorithm. The Sin chaos 1-dimensional self-mapping expression is shown in Equation (14).
The initial value in Equation (14) cannot be set to zero because if the initial value is zero, immobility and zeros will be generated at [−1, 1]. The relationship between initial value sensitivity, traversal, randomness, and the number of iterations of the Sin-chaotic 1-dimensional self-map is shown in
Figure 3 [
20]. From the sub-graphs (a) and (b) in
Figure 3, it is found that when the initial values are set differently, the chaotic sequences produced are different, and from
Figure 3c it is seen that when a certain number of generations are updated, the system will traverse the whole solution region.
4.2. Dynamic Self-Adaptation Inertia Weights
For the role of the discoverer, if it starts to approach the optimal solution of the whole space at the beginning of the algorithm iteration, search accuracy will be low. Since the search is too small, it is easy to cause the problem of not being able to get rid of the local extreme-value space. In this paper, we introduce the previous generation of global optimal solutions in the discoverer-position-calculation formula, so that the discoverer position is influenced by both the previous generation of discoverer positions and the previous generation of global optimal solutions, which can effectively prevent the situation that the best expectation value found by the algorithm is always a local extreme value. Furthermore, this paper refers to the concept of inertia weights and adds the dynamic inertia weight parameter
w to the way of discoverer position calculation [
21]. In the early stage of the iterative process, the value of
w is larger, which makes the global exploration better. In addition, at the later stage of the iteration,
w starts to decrease adaptively, which makes the local search better and also allows the algorithm to converge faster.
w is calculated as shown in Equation (15), and the improved discoverer-location-update method is shown in Equation (16).
In Equation (16),
denotes the best solution for the entire space in the j-th dimension in the last generation, and rand represents a random number between 0 and 1.
4.3. Improved Scouting-Warning-Sparrow-Update Formula
The formula for calculating the position of the detection-warning sparrow is improved according to Equation (17).
Equation (17) represents that if the sparrow is not in the best location, it flies to a location randomly between itself and the best location. Otherwise, its location will be randomly chosen between the worst and the best locations.
4.4. Incorporating Cauchy Variation and Opposition-Based Learning Strategies
OBL is a method proposed by Tizhoosh, which aims to find the corresponding backward solution based on the current solution through a backward-learning mechanism, and then it saves the relatively better solution after evaluation and comparison. In order to enhance the ability of individuals to solve the best solution, this paper incorporates the OBL strategy into the SSA, which is characterized mathematically as shown in Equations (18) and (19).
in Equation (18) denotes the inverse solution according to the tth generation optimal solution, and
r is a 1 ×
d (
d is the spatial dimension) matrix of random numbers, which obeys a (0, 1) standard uniform distribution.
ub and
lb denote the upper and lower bounds, respectively. In addition, ⊕ represents the exclusive OR, and
b1 is the control parameter of information exchange [
22], which is calculated as shown in Equation (20).
The Cauchy variation is derived from the Cauchy distribution, and Equation (21) is the one-dimensional Cauchy distribution probability-density expression.
When
a = 1, it is known as the standard Cauchy distribution.
Figure 4 shows the image of the probability density of the Gaussian distribution as a function of the Cauchy distribution.
From
Figure 4, it can be clearly observed that the left and right parts of the Cauchy distribution are flat and long, approaching 0 more gently, with a slower speed compared to the Gaussian distribution, and with a smaller peak near the (0, 0) point compared to the latter, so the Cauchy variation has a more outstanding perturbation ability compared to the Gaussian variation. Hence, applying the Cauchy variant to the target location calculation, the global-search performance of the SSA is upgraded by using the perturbation ability of the Cauchy operator.
The location update method is shown in Equation (22), and the cauchy(0, 1) in Equation (22) denotes the standard Cauchy distribution.
In addition, note that the Cauchy distributed random-variables-generating function be η, whose expression is shown in Equation (23) below.
In Equation (23), tan is the tangent function and ξ denotes a random number between 0 and 1.
In order to further upgrade the algorithm’s optimization-seeking ability, this paper adopts a method that alternates the OBL strategy and the Cauchy variational-operator-perturbation strategy with a certain probability, thereby realizing the dynamic calculation of updated target locations. The former is used to obtain the reverse solution through the reverse-learning mechanism, thus enlarging the search space of the algorithm. The latter uses the Cauchy variational operator to derive a new solution, by performing a perturbation-variation operation at the best solution location, which ameliorates the drawback that the algorithm cannot be detached from the local space. The target position is updated with the selection probability
Ps [
22] in Equation (24), as the basis for strategy selection.
θ in Equation (24) denotes the adjustment parameter, and its numerical size is taken as 1/20 in this paper.
If rand < Ps, the target position is updated using the OBL strategy of Equations (18)–(20), and vice versa, so the Cauchy variation-perturbation strategy of Equation (22) is selected.
After choosing the perturbation strategy mentioned above, although it reduces the occurrence of the phenomenon of the algorithm falling into the local space, it brings a new problem, that the fitness value of the new location is derived after the perturbation variation is performed, so it is impossible to determine whether it is superior to the fitness value of the original location. Accordingly, this paper introduces a greedy rule as shown in Equation (25). The position fitness value of x is denoted by f(x), and the need to recalculate the position information is determined by comparing the adaptation values of the two positions before and after the update.
7. Conclusions
A series of improvements are made to the SSA, which has been improved to improve its search performance significantly and can effectively get rid of the local optimum. Firstly, sparrow-position initialization is very important for global search, and this paper uses Sin-chaos initialization to initialize the population, thus enriching the diversity of solutions. Secondly, this paper introduces the dynamic adaptive-weight factor, which effectively balances the local and global-excavation ability of the algorithm. Finally, this paper integrates the Cauchy variation and OBL strategies, thus reducing the probability that the algorithm cannot get out of the local extremes, thus enhancing the global exploration ability of the algorithm.
An ISSA based on fused Cauchy variance and OBL is proposed to optimize the combined-forecasting method of LSTM hyperparameters with it, and an ISSA-LSTM-based STLF model is developed. ISSA-LSTM reduces the influence of human factors on LSTM and improves the ability of the model to capture the characteristics of power-load data. The experimental results indicate that the proposed model has higher forecasting accuracy compared with the LSTM, SSA-LSTM, PSO-LSTM, PSO-BP and PSO-LSSVM models, which provides a new idea for STLF.
The next step can be considered to continue to improve the optimization mechanism and algorithm structure of the sparrow algorithm, or try to integrate the advantages of other intelligent algorithms to propose intelligent algorithms with better performance for load forecasting. In addition, we have read a lot of the literature from other fields [
26,
27,
28,
29,
30,
31], and in the future we can consider how to incorporate them into our research topics.