Next Article in Journal
RTCA-Net: A New Framework for Monitoring the Wear Condition of Aero Bearing with a Residual Temporal Network under Special Working Conditions and Its Interpretability
Previous Article in Journal
Generating Functions for Binomial Series Involving Harmonic-like Numbers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Recovery Model of Electric Power Data Based on RCNN-BiGRU Network Optimized by an Accelerated Adaptive Differential Evolution Algorithm

1
Electric Power Research Institute, State Grid Shanghai Municipal Electric Power Company, Shanghai 200051, China
2
School of Electrical Engineering and Automation, Jiangsu Normal University, Xuzhou 221116, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(17), 2686; https://doi.org/10.3390/math12172686
Submission received: 8 July 2024 / Revised: 17 August 2024 / Accepted: 24 August 2024 / Published: 29 August 2024
(This article belongs to the Special Issue Analysis and Applications of Control Systems Theory)

Abstract

:
Time-of-use pricing of electric energy, as an important part of the national policy of energy conservation and emission reduction, requires accurate electric energy data as support. However, due to various reasons, the electric energy data are often missing. To address this thorny problem, this paper constructs a CNN and GRU-based recovery model (RCNN-BiGRU) for electric energy data by taking the missing data as the output and the historical data of the neighboring moments as the input. Firstly, a convolutional network with a residual structure is used to capture the local dependence and periodic patterns of the input data, and then a bidirectional GRU network utilizes the extracted potential features to model the temporal relationships of the data. Aiming at the difficult selection of network structure parameters and training process parameters, an accelerated adaptive differential evolution (AADE) algorithm is proposed to optimize the electrical energy data recovery model. The algorithm designs an accelerated mutation operator and at the same time adopts an adaptive strategy to set the two key parameters. A large amount of real grid data are selected as samples to train the network, and the comparison results verify that the proposed combined model outperforms the related CNN and GRU networks. The comparison experimental results with other optimization algorithms also show that the AADE algorithm proposed in this paper has better data recovery performance on the training set and significantly better performance on the test set.

1. Introduction

With the continuous growth of global energy consumption, energy conservation and emission reduction have become a common challenge for the world [1]. In order to alleviate the increasingly prominent contradiction between electricity supply and demand, China has implemented a comprehensive peak–valley time-sharing tariff system [2], which aims to encourage residents and enterprises to reduce electricity consumption during peak hours through different tariffs at different times of the day in order to adjust the load curve and improve the problem of imbalance in electricity consumption.
To ensure the fairness and reasonableness in billing, time-of-use tariffs place higher demands on the accuracy of electricity consumption data. However, during the data collection process, certain remotely collected metering data may be incorrect or lost due to communication errors, communication delays, equipment failures and other issues. This seriously affects the implementation of lean management practices in grid companies, and it also poses a risk to the reliability of data for power billing and customer service. Therefore, the recovery of missing electricity consumption data is an urgent problem.
Existing data information is used to infer and fill in the missing data points for the purpose of data recovery. The methods can be mainly divided into four categories: imputation method, interpolation method, regression method and machine learning method.
  • The imputation method is one of the simplest missing data reconstruction methods. For numerical data, the mean, median or plurality values can be used to fill in the missing data. The method is simple and fast, but it may ignore the inter-sample variability. At the same time, from the point of view of electricity users, the filled electric energy data lack the necessary basis and rationality, especially when the electric load changes more frequently.
  • The interpolation method is to estimate the values of missing data points based on the relationship of existing data points. The traditional methods include linear interpolation, polynomial interpolation, spline interpolation and so on. Although these methods can retain the trend and change characteristics of the data to a certain extent, the interpolation function is too simple to fit the complex change curve of data. Therefore, many scholars proposed many enhanced interpolation methods targeted to different data to fit the data change characteristics. Kim [3] extracts spatial direction vectors from the edge information of neighboring image data and interpolates the lost pixels on a pixel-by-pixel basis to recover image block data corrupted by transmission loss. Deng [4] proposed an enhanced random forest algorithm by integrating methods like linear interpolation, matrix combination, and matrix permutation tailored to address the challenge of missing data recovery within vast quantities of power data. Ding [5] employed a range of interpolation techniques, including radial basis functions, moving least squares, and adaptive inverse distance, for the purpose of recovering missing values in small-scale time-series data. Nonetheless, its accuracy is heavily reliant on the availability of an adequate number of reference samples. Mo [6] proposed a novel self-supervised interpolation via frequency extrapolation algorithm for regularly missing seismic data. Aliasing-free low-frequency complete data are reconstructed via the Nyquist sampling theorem and high-frequency data recovery via self-supervised frequency extrapolation.
  • Regression methods use regression models to predict the values of missing data points, which are built upon the characteristics and labeling information of existing data. In addition to traditional methods such as linear regression, ridge regression and random forest regression, scholars have designed new regression models to accommodate data sets with different characteristics. Yang [7] proposed a novel non-convex function called arc-tangent function to construct the inversion model to overcome the noisy data interpolation problem and improve the stability of non-convex regularization. Then, a prediction–projection method is used to solve the proposed models. Pan [8] introduced a linear regression model to describe the spatial relationship between sensor data collected from different sensor nodes. By incorporating information from neighboring nodes, it estimates missing data points and ultimately yields consistent and dependable data-fitting outcomes. Paik [9] optimizes the output data via convex optimization-based sparse recovery to estimate the direction-of-departure/direction-of-arrival for each target accurately. Zhong [10] proposed a piecewise sparse approximation model and a piecewise proximal gradient method to approximate piecewise signals.
  • With the increase in computing power in recent years, more and more machine learning methods have been used for data recovery, and deep learning in particular is emerging as the mathematical modeling of choice. James [11] capitalized on the grid’s topology, leveraging pre-existing measurement data, and devised a graph convolutional recursive adversarial network to process available information and uncover correlations within graph and time data in order to forecast and restore the power grid state. Chai [12] devised a model architecture utilizing the U-Net convolutional neural network. This model takes randomly sampled data as its input and produces corresponding complete data as its output. Wu [13] added five activation layers in the U-Net model to replace the origin sigmoid layer and connected them with the decoder by five side branches to output multi-scale features for the interpolation of seismic data. As a powerful tool for extracting features from data, deep learning (DL) can theoretically avoid the assumptions that limit traditional interpolation methods (such as linearity), but it requires significant computational resources.
Among the four types of data recovery algorithms, the imputation method is simple and fast, but it ignores the differences between samples. From the perspective of electricity consumers, the imputed energy data lack the necessary basis and rationality, especially when the electricity load changes frequently. The interpolation method can retain the trend and variation characteristics of electricity consumption data to some extent, but due to the simplicity of the interpolation function, it cannot fit the complex variation curves of electricity consumption data. There are a large number of electricity consumers in the power grid with significant differences in electricity usage characteristics, making it difficult to construct a suitable regression model applicable to each consumer. Compared to regression methods, deep learning methods theoretically avoid the linear constraints of data fitting and can fit any functional relationship. Therefore, the deep learning approach is an excellent choice to construct a specific neural network model, which is trained to model the recovery of missing data using the existing electricity consumption data.
It is well known that for a fixed data set, the effect of deep learning depends largely on the structure of the neural network and the training process. Under certain conditions, the deeper the network layers and the more nodes it has, the more complex the function that the neural network can characterize, the deeper it can dig into the internal features of the data, and the stronger the fitting ability of the neural network. However, this is accompanied by a rapid increase in the difficulty of network training along with a significant rise in computational power requirements and costs.
Although building complex large-scale neural network models to model the electricity consumption of individual customers can recover missing data more accurately, it may be unacceptable for grid managers. The reasons are summarized as follows:
  • The number of grid customers with missing data may be hundreds or even more, and it is nearly impossible to train a large model for each customer that consumes electricity.
  • The characteristics of the customer’s electricity consumption may change considerably as time goes on, and the model needs to be retrained accordingly. The trained large model is less time-efficient.
  • The improvement in data recovery accuracy may not be significant compared to the increased cost of a large model, making the cost-effectiveness of the large model low.
  • From a practical metering perspective, overly high accuracy is unnecessary.
At the same time, due to the relatively homogeneous types of customer electricity consumption characteristics, small-scale neural network models are also competent in modeling electricity consumption. Compared to large models, small models can drastically reduce the cost of data recovery while ensuring a certain level of accuracy, which is what grid managers want to see most. Therefore, constructing small-scale neural network models with better performance becomes the key to electricity consumption data recovery.
The electricity information collection system periodically collects electricity consumption data, forming a time series. If the collection fails, missing data will appear in the time series. In this paper, the missing data are served as the output, and historical data from adjacent time periods are used as the input to construct the corresponding time-series model.
As one of the earliest neural network used for time series modeling, a recurrent neural network (RNN) introduces recurrent connections to continuously transfer information and data within the network. The network parameters are shared across different time steps, enabling the network to remember previous information and handle sequences of arbitrary length, better managing the temporal relationships in sequential data. However, since the information in long sequences gradually fades away with the propagation of gradients through times, it is difficult for RNNs to capture long-term dependencies, and the network models have difficulty remembering information farther down the timeline. The disappearance of the gradient leads to a decline in performance with long sequences and increases the computational complexity. It makes RNNs unsuitable for large-scale data and deep networks.
Long short-term memory (LSTM) [14,15] is based on RNN and introduces input gates, forgetting gates and output gates to control the neural network to forget the old information and deliver new information. The gating mechanism inherits the time-series processing advantage of RNN while solving the problem of RNN gradient explosion.
A gated recurrent unit (GRU) [16,17], as a simplified version of LSTM, is able to capture the long time dependence of a time series, which can effectively shorten the training time of the model while guaranteeing high prediction accuracy. The bidirectional structure can further enhance this ability, so it is more suitable for the electric load forecasting. Although a GRU can consider the historical patterns of temporal data, it needs to manually construct the feature relationships, and it cannot fully explore the discontinuous features in high-dimensional space. Therefore, combining with other networks, it can enhance the mining ability of load features.
A convolutional neural network (CNN) [18,19] has superior performance in feature extraction and can efficiently process data of different dimensions and extract potential feature information. Therefore, a CNN is introduced into the GRU model for local feature information extraction, while the GRU can further learn higher-level abstract representations on these features. The combination of those two networks allows the model to obtain information from a global perspective as well as capture local details, which improves the model’s ability to process sequence data and brings better feature extraction and modeling results.
In addition to the network structure parameters, the performance of neural networks is also largely determined by the training process. The RCNN-BiGRU model proposed in this paper is similar to other neural networks in that the training parameters are difficult to determine and usually need to be selected based on human experience. Different training parameters exhibit varying fitting capabilities, training speeds, and prediction outcomes. Another challenge in front of the authors is how to quickly find the optimal combination of network architecture and training parameters as soon as possible under limited computational resources. This is essentially an optimization problem, where the structural and training parameters are the independent variables, and the fitting performance of the trained neural network model serves as the objective function.
To solve the above optimization problem, heuristic global optimization algorithms are used in this paper. Commonly used methods include the genetic algorithm (GA), particle swarm optimization (PSO), and differential evolution (DE). The GA algorithm simulates the genetic evolution process with binary coding and has a relatively weak optimization performance for continuous variables. The PSO algorithm simulates the process of birds foraging during flight to update velocity and position, but it has a slower convergence speed. The DE algorithm, with simple principle, fewer control parameters and quicker convergence, is proved to be a simple yet efficient heuristic global optimization algorithm.
In the annual CEC single-objective parameter optimization competitions, DE has consistently ranked in the top three, often securing the championship, while PSO and the GA are rarely seen. Compared with algorithms such as PSO and the GA, DE demonstrates a significant performance advantage. To find the optimal combination of network architecture and training parameters, the solution to the aforementioned optimization problem requires training the network model, which incurs corresponding computational costs and economic expenses. This places higher demands on the speed of the optimization algorithm.
As a significant advancement in the history of evolutionary algorithms, the differential evolution (DE) algorithm [20] has been widely applied across various fields due to its understandable, simple, and effective operations, demonstrating excellent convergence and robustness [21,22]. The algorithm proposes an accelerated variational operator to speed up the algorithm search for the high demand of searching fastness, and two new adaptive adjustment methods are designed for the problem of difficult selection of control parameters of the DE algorithm.
Since the network training parameters include real-valued variables and network optimization requires high speed, GA and PSO algorithms find it difficult to meet these needs. Therefore, this paper proposes an accelerated adaptive differential evolution (AADE) algorithm to find the optimal combination of network architecture and training parameters to enhance the accuracy and stability of recovered data. This improved algorithm addresses the high-speed search requirement by introducing an accelerated mutation operator to speed up the search process. Additionally, two new adaptive adjustment methods are designed to alleviate the difficulties in the selection of control parameters in the AADE algorithm.
To validate the effectiveness of the method proposed in this paper, actual electricity consumption data are selected as the sample to train the relevant neural network models. Firstly, the network structure parameters and training parameters are fixed to compare the advantages and disadvantages of several different network structures. The results indicated that the RCNN-BiGRU combined network proposed in this paper performs better than the other network structures. Next, Taguchi’s analytical design method is used to identify the general range of network architecture parameters and training parameters for the potential optimal network, which served as the value range for the independent variables in the optimization problem. Finally, the AADE algorithm is used to solve the optimization problem to find the best combination of network structure parameters and training parameters. The superiority of the AADE algorithm is verified through comparative experiments.
The rest of this paper is organized as follows. In Section 2, the RCNN-BiGRU prediction model is presented along with a description of each network layer. The accelerated adaptive differential evolution (AADE) algorithm for optimizing RCNN-BiGRU is introduced in Section 3. Section 4 decribes the flowchart of the power data recovery method in detail. Numerical simulation and analysis are demonstrated in Section 5 to verify the advantages and efficiency of the proposed data recovery method. Finally, Section 6 concludes the whole paper and directs several topics to be further researched in the future.

2. Construction of the RCNN-BiGRU Prediction Model

In order to find the relationship between the missing data and those from the neighboring moments in the time series, a combined RCNN-BiGRU neural network prediction model is constructed. This model primarily consists of a residual CNN (RCNN) network and a bidirectional GRU (BiGRU) network. The RCNN is mainly used to extract local feature information from the time series, while the BiGRU further learns high-level abstract representations of these features. This section first provides a detailed explanation of the principles of the main components in the model and then describes the roles and connections of each network layer in the RCNN-BiGRU combined neural network prediction model.

2.1. Basic Principles of BiGRU

To address the high computational complexity and large training data requirements of LSTM, a gated recurrent unit (GRU) simplifies its structure. The specific structure of a GRU network is shown in Figure 1 below.
Compared to the three gating mechanisms of LSTM, a GRU has only two key gating units: the reset gate and the update gate. The reset gate allows the model to decide whether or not to ignore the hidden state from the previous time step, helping to eliminate unnecessary historical information and preventing the model from overly relying on distant past states. Conversely, the update gate controls the proportion of the current hidden state that comes from the hidden state of the previous time step and the proportion that comes from the current input. This dynamic adjustment enables a GRU to more flexibly capture dependencies across different time scales. Additionally, a GRU has no separate memory cell and uses a single hidden state as the memory of the process. The current hidden state is computed by combining the current input with the hidden state from the previous time step. It makes a GRU simpler than LSTM. These factors simplify the model structure with fewer parameters and reduce the computational complexity, resulting in faster training speeds for a GRU.
The bidirectional gated recurrent unit, abbreviated as BiGRU, as shown in Figure 2, can be regarded as an extended version of a GRU, consisting of two GRU networks operating in opposite directions. One network processes the time-series data from the beginning to the end, while the other processes it from the end to the beginning. Both of them produce an output, and a BiGRU combines these two outputs to obtain the final result. This bidirectional structure for processing information can simultaneously capture information from both past and future and adaptively learn long-term dependencies and interactions among multiple variables in time-series data. This leads to a more comprehensive modeling of temporal relationships in time-series data.

2.2. Basic Principles of Residual CNN

CNNs were initially used in image processing, and their powerful feature extraction capabilities allow them to be effectively applied to time-series data as well. Convolutional operations as the core of CNNs are conducted through the sliding window of filters (also known as convolution kernels) on the input data to extract the local patterns and their features. In time-series prediction, CNNs can help the model capture the local dependencies and periodic patterns of the data as well as provide powerful feature representations for subsequent predictions.
Adding CNN layers to the neural network through local connections and weight sharing not only reduces the number of model parameters and computational complexity but also retains the spatial structure information of the data. The deeper the layers of the network, the wider the solution space that can be covered, and theoretically, the higher the accuracy that should be available. However, simply accumulating the number of layers does not directly lead to better convergence and accuracy. On the contrary, as the number of network layers increases, the network degenerates. The training error becomes larger and larger, and the model becomes more and more difficult to train.
To address this problem, He et al. [23] proposed the concept of residual networks, which helps to solve the problem of gradient vanishing and gradient explosion, while training deeper networks with good performance. The design of the residual network shown in Figure 3 mainly includes shortcut connections and identity mapping. The shortcut connection makes the residual possible, while the identity mapping makes the network deeper. There are two main identity mappings: skip connection and activation function.
The residual structure [24] transforms the training objective from an equivalent mapping H ( x ) to H ( x ) x approximating 0, solving the problem that the network is too deep and difficult to train. The learning objective is changed to learning the identity mapping, that is, making the input x approximate to the output H ( x ) . Since approaching any function other than 0 will cause network degradation, this change will not cause a decrease in accuracy in the subsequent layers.
The residual structure is widely used because it maintains the same network size and number of channels while adding only a pairwise additive operation, which is extremely computationally inexpensive but improves the performance of the network substantially.

2.3. Construction of the RCNN-BiGRU Combined Model

To predict the missed electric grid load data, this paper constructs a RCNN-BiGRU composite network model as shown in Figure 4.
The combined model consists of the following layers:
  • Input Layer: The starting point where input sequences are fed into the network.
  • Convolutional Layer: It is utilized to extract local features of the input sequence and capture critical patterns within them. Different sizes of convolutional kernels can be combined to capture features of various scales. Here, 1D convolution layers of width of 3, 3, and 1 are used with 16 filters respectively.
  • Batch Normalization Layer: By normalizing the input of each batch, the input values of the activation function are within a smaller range, and the convergence speed is increased, thus accelerating the training process of the neural network.
  • Relu Layer: With the simple nonlinear transformation, the ReLU layer is widely used in modern deep learning models. Owing to its ability to avoid the vanishing gradient problem, high computational efficiency and sparse activation characteristics, the ReLU layer significantly enhances the training effectiveness and performance of these models.
  • GRU Layer: Extracts features from the input sequence.
  • Flip Layer: Perform flip operations on the input data and construct a backward GRU network.
  • Addition Layer: Perform element-wise addition and summing multiple input tensors to produce a single output tensor. The addition layer can merge information from different network branches to achieve residual connections.
  • Concatnation Layer: Achieve the fusion of multiple features and multi-scale feature integration by concatenating tensors along a specific dimension to improve the network’s representation capability and performance.
  • Fully Connected Layer: Provide the final stage of feature processing and mapping to output regression targets. Each neuron in this layer is connected to every neuron in the previous layer.
  • Regression Layer: As the final layer of the network, it processes the context vector and produces the final prediction for the sequence.

3. Model Optimization Methodology: Accelerated Adaptive Differential Evolution Algorithm

After the RCNN-BiGRU model is constructed, it needs to be trained based on sample data. Similar to other neural networks, the structural parameters and training parameters have a large impact on the model performance. The challenge addressed by the proposed accelerated adaptive differential evolution (AADE) algorithm is how to use less computational power to quickly find the optimal parameter combination. During the optimization process, network training is required, which places higher demands on the speed of the optimization algorithm. For this reason, the improved algorithm uses an accelerated variational operator to speed up the algorithm search in order to reduce computational consumption. At the same time, to alleviate the difficulties in the selection of control parameters in the AADE algorithm, two new adaptive adjustment methods are designed to improve the accuracy and stability of the electricity consumption prediction.
This section first introduces the main process of the classical differential evolutionary algorithm, including the necessary notation and terminology. It then provides a detailed explanation of the accelerated mutation operator and two adaptive parameter setting schemes, and finally, it presents the whole workflow of the AADE algorithm proposed in this paper.

3.1. Classical Differential Evolution Algorithm

The differential evolution algorithm has few control parameters and fast convergence speed. It mainly includes three operations with simple principles: mutation, crossover, and selection. The specific process is described as follows [25]:
(1)
Initialization
N P real-valued vectors, x i , g = ( x i , 1 , g , x i , 2 , g , , x i , D , g ) , i = 1, 2, …, N P , serve as a population at each generation g in DE, where N P is the number of members in the population and D represents the problem dimension. The initial population should cover the whole searching space, and thus, it is usually chosen randomly within the boundary constraints with uniform distribution. Take the j-th initial value of the i-th vector for example; the value can be generated according to
x i , j = x j , m i n + r · ( x j , m a x x j , m i n ) , j = 1 , 2 , , D
where r is a uniform random number between 0 and 1 generated for each j and each i independently. x j , m a x and x j , m i n are the upper and lower boundaries for the j-th component of the vector x, respectively. After initialization, DE enters a loop of three evolutionary operations (mutation, crossover, and selection) until some specific terminating conditions are satisfied. The crucial idea behind DE is a scheme for generating trial vectors. Mutation and crossover are employed to generate a trial vector for each target vector (or parent vector) x i , g , and then selection determines which of the vectors to survive into the next generation.
(2)
Mutation
DE produces a trial vector v i , g after mutation through differential strategy. A new arithmetic operator depending on the differences between randomly selected pairs of individuals replaces the mutation operator that is realized upon some probability distribution function. The most commonly used differential strategy is to select two different individuals and add the multiplied difference vector ( x r 1 , g x r 2 , g ) to another individual x r 0 , g . The amplification of the difference vector is controlled by a real mutation factor F. The mutation can be carried out as follows:
v i , j = x r 0 , g + F · ( x r 1 , g x r 2 , g ) ,
where r 0 , r 1 and r 2 are three distinct integers uniformly chosen from the set 1 , 2 , , N P except i. It should be noted that if any component of the trial vector violates its predefined boundary constraint, this component is set to its corresponding boundary value.
(3)
Crossover
As a compensation for mutation, the mutation vector v i , g is mixed with its corresponding target vector x i , g to construct the final trial offspring vector, u i , g = ( u i , 1 , g , u i , 2 , g , , u i , D , g ) . The binomial crossover operation after mutation can be accomplished as shown in Equation (3).
u i , j , g = v i , j , g , j = j r a n d or r < C R x i , j , g , otherwise
where j r a n d is a random integer in { 1 , 2 , , D } to ensure that the trial vector u i , g will differ from its corresponding target vector x i , g on at least one dimension. Otherwise, no new vectors will be produced, and the population will not alter. C R is the crossover probability and roughly corresponds to the average proportion of components in the trial vectors that are inherited from the mutation vectors.
(4)
Selection
The greedy scheme shown as Equation (4) is utilized in the selection between the trial vector u i , g and its corresponding target vector x i , g . The vectors with better fitness are more favorable and will become members of the population in the next generation.
x i , g + 1 = u i , g , f ( u i , g ) f ( x i , g ) x i , g , otherwise
If, and only if, the trial vector u i , g yields a better fitness than its corresponding target vector x i , g , the old target vector is replaced by the new trial vector. Otherwise, the target vector is retained to be a member in the next generation. The one-to-one greedy selection procedure here is generally fixed in different DE variants; meanwhile, there have been many variants of mutation and crossover approaches other than the above described operations.
(5)
Termination
Evolution is terminated if either of the following two criteria is met: the current generation exceeds the maximum generation or the fitness precision of the best individual satisfies the requirements determined in advance. If not, the crossover, mutation and selection operations are continued to be executed on the population.
Based on the understanding of classical differential evolutionary algorithms, the differences between that and the AADE algorithm proposed in this paper are described in detail in the following section. This algorithm introduces an accelerated mutation strategy to speed up the search without reducing diversity, and it proposes two new adaptive methods to dynamically adjust F and C R , adapting to different terrains and simplifying parameter settings.

3.2. Accelerated Mutation Operator

DE produces the mutation vectors on the basis of differences between randomly selected pairs of individuals rather than some probability distribution functions. The final trial offspring vectors are combinations of the target vectors and their corresponding mutation vectors using a binomial or experimental crossover operation. Therefore, the mutation operation is the core in DE, and many previous researches focus on the generation strategies of the mutation vectors.
The original DE employed a “DE/rand/1” strategy to obtain new mutation vectors. The difference vector between two distinct individuals selected from the population is multiplied with the mutation factor F to add to a third distinct individual. F is a real number that controls the amplification of the difference vector. Since the classic “DE/rand/1” strategy is raised, many variants employing different learning strategies and recombination operations have been presented. In order to classify different variants, Storn [26] had introduced a general convention used for naming various mutation strategies. The notation “DE/ x / y / z ” is used, where DE stands for differential evolution; x represents the basis vector to be mutated; the number of difference vectors considered is denoted by y; and z specifies the crossover scheme being used. The most widely used DE mutation operators are listed below: they are “DE/rand/1”, “DE/rand/2”, “DE/best/1”, “DE/best/2” and “DE/current-to-best/1”.
(1)
“DE/rand/1” and “DE/rand/2”
v i , j = x r 0 , g + F · ( x r 1 , g x r 2 , g ) ;
v i , j = x r 0 , g + F · ( x r 1 , g x r 2 , g ) + F · ( x r 3 , g x r 4 , g )
(2)
“DE/best/1” and “DE/best/2”
v i , j = x b e s t , g + F · ( x r 1 , g x r 2 , g ) ;
v i , j = x b e s t , g + F · ( x r 1 , g x r 2 , g ) + F · ( x r 3 , g x r 4 , g )
(3)
“DE/current-to-best/1”
v i , j = x i , g + F · ( x b e s t , g x i , g ) + F · ( x r 1 , g x r 2 , g )
where r 0 , r 1 , r 2 , r 3 and r 4 are five distinct integers uniformly chosen from the set { 1 , 2 , , N P } except i. x b e s t , g represents the best individual among the population in the current generation.
Similar to that shown above, the mutation operator consists of two parts including the basis vector and the difference vector. From the searching perspective, the basis vector and difference vector can be viewed as the search center and radius. DE carries out indeed a local search around the basis vectors. The selection of basis vectors plays the most important role in DE and determines the capacities of exploration and exploitation. Since the basis vectors are selected randomly from the current population in “DE/rand/1” and “DE/rand/2”, the search proceeding in the whole variable space results in stronger exploitation capacity but slower convergence. On the contrary, searching around the best individuals in “DE/best/1” and “DE/best/2” leads to stronger exploration capacity and higher precision. However, because of the poor diversity of the basis vectors in “DE/best/1” and “DE/best/2”, the population tends to be trapped toward the local optima and seldom escapes from those inferior areas. Unlike the above four schemes, “DE/current-to-best/1” chooses unknown positions between the current individuals and the best individual as the search center. Better solutions are more likely to be found around the existing individuals in the current population, whether the best one or not, than the unknown areas. Moreover, like that in “DE/best/1” and “DE/best/2”, the population with “DE/current-to-best/1” is apt to stagnate in the regions in which the local optima is located because all individuals approach the best individual in each generation.
Based on the above analysis, a new selection method for the basis vectors is proposed to improve the performance of DE. For each target individual, a small subpopulation including p s · N P individuals is constructed randomly from current population, and the local best individual among them is found out. A comparison between the current target individual and its corresponding local best individual is conducted with respect to the fitness, and then the better one is severed as the basis vector. In this novel accelerated mutation strategy, a mutation vector is generated in the following manner:
v i , g = x i , g + F · ( x r 1 , g x r 2 , g ) , f ( x i , g ) f ( x l b , g ) x l b , g + F · ( x r 1 , g x r 2 , g ) , f ( x i , g ) < f ( x l b , g )
where x l b , g represents the local best individual among the corresponding subpopulation for current target individual x i , g .
Within this new scheme, the better the individual is, the more possibility the search is preceded around it. It should be noted that the worst p s · N P individuals in each generation would not be selected as the basis vectors. The avoidance of search in those unpromising areas accelerates the search in the whole variable space. Better solutions are more likely to be found around the better individuals than those around worse individuals intuitively. Moreover, the population diversity maintains better under the circumstances that all individuals except for the worst p s · N P ones in each generation can be chosen as the basis vectors with biased probabilities.
Compared to “DE/rand/1” and “DE/rand/2”, the new accelerated mutation strategy has stronger exploration capacity and higher search efficiency to find out better solutions with high precision. From another aspect, the new accelerated mutation strategy owns better exploitation capacity than “DE/best/1” and “DE/best/2” to prevent the population trapped in the areas around the local optima. It can be regarded as a generalization of “DE/rand/1” and “DE/best/1”. The balance between exploration capacity and exploitation capacity depends on the size of the subpopulation ( p s · N P ) for each target individual. Larger p s values lead to a larger possibility of local search and smaller values for global search. When p s is equal to 1, the new accelerated mutation strategy is the same as “DE/best/1”. Since the new accelerated mutation strategy drives a local search around the existing individuals, the subpopulation size ps must be small enough to guarantee the scope of the global search adequately. To further improve the performance, p s is defined as a real number randomly chosen from the interval [0.05, 0.2] and different for each target individual in this paper.

3.3. Two Adaptive Parameter Setting Schemes

There are several parameters to be set in DE, including the number of members in a population N P , the mutation factor F, the crossover probability C R , the total generations of evolution and so on. Choosing suitable control parameter values is frequently a problem-dependent task because suitable control parameters are inconstant for different function problems. Among those parameters, F and C R are the two most important ones which determine the generation of the trial vectors in the evolution. Generally, most of the parameter control techniques are based on the self-adaptation of the two parameters (F and C R ) associated with the evolutionary process.
The main goal here is to produce a flexible DE in terms of the control parameters F and C R . There is no consistent methodology and uniform standard for determining the F and C R values. The trial-and-error method used for tuning the control parameters requires multiple runs according to the previous knowledge and experience. The best control parameter settings for DE are problem dependent. Various adaptive mechanisms have been applied to update the control parameters dynamically without a user’s prior knowledge of the relationship between the parameter setting and the problem characteristics. They showed faster and more reliable convergence performance than the classic DE algorithm. In addition, the parameter adaptation mechanism, if well designed, is capable of improving the robustness and convergence performance. It is also crucial for DE to find a most suitable parameter control scheme.
As analyzed in Section 3.2, most mutation strategies including the new accelerated mutation strategy drive a local search around the existing individuals in the population, and the search radius relies on the amplification of the difference vector with the mutation factor F. In other words, not only the mutation factor F but also the difference vector determines the search, and it does not mean that larger F values lead to a larger search radius. However, all previous researchers ignore this important point, and previous self-adaptive mechanisms are designed without considering the effect of the difference vectors. A novel approach shown in Equation (6) is proposed to adjust the mutation factor F relying on the difference vectors.
F i = max ( F i , F m i n ) ; F i = F m i n + ( F m a x F m i n ) · ( 1 2 · j = 1 D ( x r 1 , j , g x r 2 , j , g ) 2 j = 1 D ( x m a x , j , g x m i n , j , g ) 2 ) ; x m a x , j , g = max i ( x i , j , g ) ; x m i n , j , g = min i ( x i , j , g ) ; i = 1 , 2 , , N P , j = 1 , 2 , , D
where F i represents the mutation factor F for the i-th mutation vector. F m a x and F m i n are the maximal and minimal values for F, respectively. x m a x , j , g and x m a x , j , g stand for the maximal and minimal values of the j-th variable for current population in the g-th generation, that is, the coordinates of two vertexes of the hypercube constructed by the population, respectively.
It can be seen from Equation (6) that F is set according to the norm of the chosen difference and the distribution of current population. The larger the norm of the chosen difference is, the smaller F is used to obtain higher search efficiency. Instead, when the norm is small, large F values would expand the global search scope to find out better solutions. As known to all, the distributing range of the population becomes smaller and smaller as the search evolves and the population gradually converges to a small area. The loss of diversity resulted by the rapid constriction of the distribution range of the population is the very reason why the population is easy to be stagnated around the local optima prematurely. The proposed parameter adaptation for F can slow down the constriction of the distribution range and keep higher diversity of the population effectively. Equation (7) is carried out to make the direction of the forthcoming search to be the same as that of the better one and opposite to that of the worse one between the two randomly selected individuals in the difference vector. This improvement would further increase the opportunities to find out a better solution.
F i = F i , f ( x r 1 , g ) f ( x r 2 , g ) F i , f ( x r 1 , g ) < f ( x r 2 , g )
The crossover probability C R determines the average percentage of components in the trial vector that are inherited from the mutation vector. Storn and Price [27] tried to find the first reasonable C R choice, guessing it would be 0.1, and found that the C R value of 0.9 may also be a good initial choice because the large C R value can speed up the convergence. Recently, Ronkkonen et al. [28] suggested that the C R values should lie in [0, 0.2] when the function is separable, while they should lie in [0.9, 1] when the variables are dependent. However, Gämperle [29] recommended the interval [0.3, 0.9] as the effective range for the C R values. Even though there are varieties of parameter settings for C R , an agreement is reached that small C R values for separable functions and large C R values for non-separable functions are much more suitable and effective. The C R values should be close to either 1.0 or 0.0 depending on the problem characteristics. Moreover, the requirements for the C R values are not unaltered in the whole search process. Large C R values are desired to receive fast convergence, and small C R values are beneficial to improve the precision of solutions. Since the C R values heavily depend on the searching process and the problem to be solved, it is difficult to choose the most appropriate value in advance with the problem characteristics unknown.
A self-learning parameter adaptation of C R is presented on the basis of the above analysis. Two intervals including R1 [0.05, 0.15] and R2 [0.85, 0.95] are defined as the range for C R . At each generation, the crossover probability C R i for each individual x i is independently generated according to a normal distribution with mean μ C R and standard deviation σ C R . The standard deviation σ C R is set to be 0.025 in this paper, and the candidate values of μ C R include 0.1 for R1 and 0.9 for R2. The selection of μ C R relies on a probability p R 1 . p R 1 stands for the probability of that 0.1 is served as μ C R ; that is, the probability of that C R i is chosen from the first interval R1. The selected μ C R is subsequently applied to generate a C R value for the corresponding target vector to form a trial vector.
It should be noted that p R 1 keeps constant in one specific learning period (defined as T generations). The probability p R 1 is updated at the end of each learning period according to the successes in generating improved solutions within the current learning period. It is considered to be a success if the trial vector is better than its corresponding target vector and a failure otherwise. Obviously, the larger the success rate for the C R values from the first interval R1 in the current learning period, the larger the probability p R 1 of applying it to generate trial vectors in the next learning period. The selective probabilities with respect to each interval are initialized as 0.1, i.e., p R 1 = 0.1. And then, the probability p R 1 updates as follows.
p R 1 = i = 1 T S i 1 i = 1 T U i 1 + ε ( i = 1 T S i 1 i = 1 T U i 1 + ε ) + ( i = 1 T S i 2 i = 1 T U i 2 + ε )
where U i 1 and U i 2 represent the number of trial vectors utilizing C R values chosen from R1 and R2 in the i-th generation in a learning period, respectively. S i 1 and S i 2 represent the number of trial vectors generated by the C R values from R1 and R2 that can successfully enter the next generation. The small constant value ε = 0.01 is used to avoid the possible null success rates.
As explained above, the C R values for each target vector can be selected in the following manner.
C R i = N ( 0.1 , 0.025 ) [ 0.05 , 0.15 ] , if r < p R 1 N ( 0.9 , 0.025 ) [ 0.85 , 0.95 ] , otherwise
where N ( μ , σ ) produces a normal distribution of mean μ and standard deviation σ . The real number randomly generated from the normal distribution must be located in its corresponding range, and it would be regenerated until the interval constraint is satisfied. r is a uniform random number between 0 and 1 generated for each target vector independently.
The adaptive adjustment schemes of F and C R proposed in this paper update the control parameters dynamically according to various problem characteristics and provide the best choices of parameters to make it more efficient and competitive as the evolution proceeds.

4. Process of Data Recovery

In this paper, we constructed an RCNN-BiGRU combined network based on a CNN and GRU by taking the missing data as the output and the historical data of adjacent time periods as the input. At the same time, for the difficulty in selecting network architecture and training parameters, an accelerated adaptive differential evolution algorithm is proposed in order to find the best combination of parameters with minimum arithmetic power consumption so as to construct the RCNN-BiGRU network with better performance for the recovery of missing data.
The whole recovery process of missing data is shown in Figure 5.
In summary, the recovery method of power grid data proposed in this article consists of the following six parts.
  • Data Preprocessing: Prepare the electric load data by normalizing and dividing into train and test sets.
  • Network Construction: Build the RCNN-BiGRU neural network with an initial configuration of layers and neurons according to the model structural parameters.
  • Hyperparameter Optimization: Utilize the AADE algorithm to optimize key hyperparameters of the RCNN-BiGRU model, such as the number of neurons in each layer, and the initial learning rate.
    (a)
    Initialization: Generate an initial population of possible hyperparameter combinations.
    (b)
    Evaluation: Train the RCNN-BiGRU network using each set of hyperparameters and evaluate its performance.
    (c)
    Selection: Select the best-performing hyperparameter sets based on the evaluation.
    (d)
    Mutation and Crossover: Apply mutation and crossover operations to create a new population of hyperparameters.
    (e)
    Iteration: Repeat the evaluation, selection, mutation, and crossover steps until convergence or a predefined number of iterations is reached.
  • Model Training: Train the optimized RCNN-BiGRU model on the training set using the best hyperparameters obtained from the AADE optimization algorithm.
  • Model Evaluation: Evaluate the performance of the trained model on the test set to verify its effectiveness in electric load forecasting.
  • Deployment: Deploy the final RCNN-BiGRU model for the practical recovery of missed electric load data.

5. Experiments and Analysis

In this section, the performance of the AADE-RCNN-BiGRU on electric load prediction is extensively investigated based on a large number of experimental studies. All experiments in this paper are implemented in Matlab R2024a using a PC with Intel i5-13400 CPU, 16 GB RAM and 64-bit Windows 10 operating system.
Actual electric load data were selected to train the model, which is sampled at 15-min intervals for 365 days a year with 96 points per day. A data set is constructed with data from 25 consecutive points in time as input and data from the next point in time as output. A portion of it is intercepted, of which 80% is used as training data and 20% is used as testing data.
It should be noted that the training data and validation data were standardized separately during data preprocessing. The normalization process enjoys a max–min linear normalization mapminmax, where all data are mapped to [0, 1].
During the model training process, the weight coefficients are optimized using the Adam algorithm with the gradient threshold set to 1 and the regularization parameter set to 0.001.

5.1. Performance Comparison of Different Networks

In order to verify the effectiveness of CNNs, the residual structure and the bidirectional structure introduced in the composite model, different neural network models were constructed, and the same model parameters were used to compare the prediction effects of the training set and the test set.
The meanings and abbreviations of these networks are given as follows:
  • GRU: Single unidirectional GRU network;
  • BiGRU: Single bidirectional GRU network;
  • CNN: Single CNN network without residual structure;
  • RCNN: Single CNN network with residual structure;
  • CNN-GRU: CNN without residual structure + unidirectional GRU network;
  • CNN-BiGRU: CNN without residual structure + bidirectional GRU network;
  • RCNN-GRU: CNN with residual structure + unidirectional GRU network;
  • RCNN-BiGRU: CNN with residual structure + bidirectional GRU network.
Here, the number of convolutional kernels is 16, and the window width is selected to be 3 in the convolutional layer of these networks. Six regression parameters including M A E , M A P E , M S E , R M S E , R 2 , and A R 2 are selected as the comparison indexes of network performance, in which the smaller values of M A E , M A P E , M S E , and R M S E indexes are better, and the larger values of R 2 and A R 2 are better. The same training set is used to train each network, and they are also validated on the same test set. The experimental results are shown in Table 1 and Table 2.
As can be seen from Table 1 and Table 2, the RCNN-BiGRU composite network proposed in this paper outperforms the other network forms in most of the metrics both in the training and test sets. This can be seen more clearly in Figure 6 and Figure 7. Note: The M A E , M A P E , M S E , and R M S E metrics are normalized, with R 2 and A R 2 becoming 1 R 2 and 1 A R 2 , respectively.
In order to analyze the impact of CNNs on network performance, a GRU is compared with a CNN-GRU and RCNN-GRU, and the BiGRU is compared with a CNN-BiGRU and RCNN-BiGRU, and the results are shown in Figure 8. From the figure, it can be seen that the R M S E metrics of both the GRU and BiGRU networks have decreased after the addition of a CNN or RCNN, which indicates that the addition of a CNN is effective in improving network performance.
In order to compare the effects of a unidirectional GRU and bidirectional GRU on network performance, a GRU is compared with a BiGRU, a CNN-GRU is compared with a CNN-BiGRU, and a RCNN-GRU is compared with a RCNN-BiGRU, and the results are shown in Figure 9. As shown in the figure, the bidirectional structure is able to bring lower R M S E metrics and improve the performance of the network compared to those using unidirectional GRUs, which suggests that the application of bidirectional structure is justified.
To verify the effectiveness of the residual structure for CNN networks, a CNN is compared with an RCNN, a CNN-BRU with an RCNN-GRU, and a CNN-BiGRU with an RCNN-BiGRU. The results are shown in Figure 10. Focusing on the figure, the introduction of the residual structure in CNN reduces the R M S E values, which shows that the employment of the residual structure is beneficial to improving network performance.
In summary, the introduction of a CNN in front of a GRU network can better extract the local dependence and periodic patterns of the data and provide a powerful feature representation for the subsequent prediction. The use of a bidirectional structure enables the GRU network to capture the forward and backward information more comprehensively. Meanwhile, the residual structure can alleviate the problem of gradient vanishing in the deeper network and promote the network to converge faster, which improves the training efficiency and performance.

5.2. Analysis and Setting of Hyperparameters

For neural networks, the fitting ability, training speed and prediction effect of models of different sizes are highly variable, and they often need to rely on human experience to choose, which is difficult to determine. Meanwhile, the training parameters have a large impact on the final training results, and the efficiency of the model training largely depends on the selection of these parameters. Therefore, choosing appropriate model structure parameters and training parameters is the key to neural network model construction.
In this paper, the newly proposed AADE algorithm is used to optimize the network model. In order to improve the pertinence, we first find out which model structure parameters and training parameters there are. The optimization ranges of parameters in the RCNN-BiGRU model are determined after analyzing their impact on network performance.
The main model structural parameters of the RCNN-BiGRU model include the number of nodes in the GRU layer and the number of convolutional cores in the CNN layer. The number of convolution kernels in the CNN layer is usually taken as a fixed multiple of 16, such as 16, 32, or 64, while the number is generally smaller than the dimension of the input feature. The input dimension of the training sample is 25 in this paper, so the number of convolutional kernels is 16 and the window width is selected to be 3 in the convolutional layer of these networks. However, the number of nodes in the GRU layer is not empirically desirable, so it can be used as a one-dimensional optimization variable.
During model training, the maximum number of training sessions and the initial learning rate are the two most important parameters that determine the network performance. Although increasing the number of training times can help the model fit the training data better, too much training may lead to overfitting; that is, the model performs well on the training data but works poorly on new data that have not been seen. Overfitting limits the generalization ability of the model and makes it ineffective in practical problems. A learning rate value that is too small may may lead to a long and potentially bogged down training process, while a too large value may cause the suboptimal weights to be learned too quickly or the training process to be unstable.
The proposed AADE-RCNN-BiGRU method has three key parameters to set: that is, the number of nodes in the GRU layer ( n u m G r u ), the maximum number of training times ( m a x E p o a c h ) and the initial learning rate ( l e a r n R a t e ). Their influences on the performance are investigated through the Taguchi method with the above used data set. The statistical analysis is accomplished with the Minitab tool [30]. The number of factor levels is set to be 5 for each parameter, that is, n u m G r u = { 5 , 20 , 50 , 100 , 200 } , m a x E p o a c h = { 10 , 50 , 100 , 200 , 500 } , l e a r n R a t e = { 0.001 , 0.003 , 0.005 , 0.01 , 0.02 } . The orthogonal array L 25 ( 5 6 ) is chosen. The R M S E values obtained on the training set are recorded as the average response variable (ARV) values. The significance of the parameters is reported in Table 3, and the trend of each factor level is illustrated in Figure 11.
It can be clearly seen from Table 2 that m a x E p o a c h has the most significant influence on the performance of the AADE-RCNN-BIGRU and n u m G r u has the least. Another statistical analysis is conducted, and the trend of each factor level is illustrated in Figure 12.
As you can see from those figures, the network performance is degraded when the number of nodes in the GRU layer is too large or too small with values in the range of 10 to 25 being optimal. With the increase in the maximum number of training times, the R M S E value gradually decreases. When it is greater than 100, the increase in the maximum number of training times is not obvious to the network performance improvement, which means that the learning is basically completed at this time. As a result, the maximum number of training times can be set to a fixed value and is no longer used as an optimization variable. When the learning rate is less than 0.001 or more than 0.01, the performance of the network decreases drastically, so the optimal value of the learning rate ranges from 0.001 to 0.01.

5.3. Comparison of AADE with Other Optimization Algorithms

After the analysis in the previous section, the number of nodes in the GRU layer and the initial learning rate are the two parameters that have the greatest impact on the performance of the RCNN-BiGRU model, so they are used as optimization variables to be optimized using the AADE algorithm. The optimization range of the number of nodes in the GRU layer is set to be from 1 to 30, and the optimization range of the initial learning rate is sought to be from 0.001 to 0.01.
As noted in Section 5.2, the more training iterations the network undergoes, the better its performance is. Consequently, this parameter does not need to be optimized and can be set according to the actual computational resources during practical application. To reduce the computational consumption, this paper sets the number of training iterations to 30, although a higher value can be used when training the model in practice.
There are four parameters that needed to be set in classical DE algorithms, including the number of members in a population ( N P ), the mutation factor (F), the crossover probability ( C R ) and the total generations of evolution ( n I t e r ). AADE proposes two new adaptive methods to dynamically adjust F and C R ; thus, only two parameters, N P and n I t e r , require a human setting. For the AADE algorithm used in this paper, N P = 10 and n I t e r = 30, which are the same as the settings for all other comparison algorithms.
Various algorithms in the literature [26,27,28,29,31] are selected for comparison with the AADE algorithm proposed in this paper, and the parameters are set according to the original literature. Other parameters utilized in our experiments are given in Table 4. Twenty-five independent experiments are conducted, and the corresponding metrics are calculated for the training set and test set, respectively. The results are shown in Table 5 and Table 6.
As can be seen from Table 5, the sorting numbers of each algorithm on every metric are placed in brackets. Based on the average ranking results of all metrics, the JADE algorithm is the best on the training set, which is followed by the DE and AADE algorithms. Next are SaDE, OXDE, and CoBiDE, respectively, and ACoDE is the least effective. Specifically, the DE algorithm is best on four metrics, M S E , R M S E , R 2 and A R 2 , while the JADE algorithm performs best on the remaining two metrics. The AADE algorithm raised in this paper obtained suboptimal results on two metrics and ranked 3rd on four metrics. From the above analysis results, it can be concluded that the results of AADE on the training set are not the best but relatively advanced compared to other algorithms.
The AADE algorithm performs significantly better on the test set compared to the training set. Compared with the training set, the performance of the AADE algorithm on the test set is significantly better than that of other algorithms. It is can be seen from Table 5 that the AADE algorithm is better than the comparison algorithm in all six metrics and ranks first overall. The JADE algorithm, which performs best on the test set, ranks third behind the AADE and SaDE algorithms. Meanwhile, the performance of the DE algorithm becomes poor on the test set, and the otherwise less effective OXDE, CoBiDE and ACoDE remain poor.
In summary, although the AADE algorithm does not have the best fitting index when training sample data, the trained model performs outstandingly well on the new test set. Therefore, the improved algorithm proposed in this paper is suitable for finding the optimal hyperparameters of the network model. This can be seen more clearly in Figure 13 and Figure 14. Note: R 2 and A R 2 are becoming 1 R 2 and 1 A R 2 , respectively, and then all metrics are normalized.
Wilcoxon’s rank-sum tests with the 5% significance level were conducted for the proposed AADE method against other compared algorithms in a statistically significant way. Table 7 and Table 8 report the results of rank-sum tests for the training set and test set separately. In those two tables, “1“and “−1“ represent that AADE performs better than and worse than the corresponding algorithm at the 95% confdence, respectively, while “0” indicates that the two compared algorithms perform similarly to each other. The last three rows summarize the number of times that AADE performs better than (“1”), similar to (“0”) and worse than (“−1”) a comparison algorithm among the six metrics.
According to Table 7, AADE consistently outperforms ACoDE on each metric and is inferior to DE and JADE on MSE. Statistically, there is no difference between the performance of AADE and that of SaDE. In addition, AADE shows no significant advantage over other algorithms. Focusing on Table 8, AADE exhibits better capability than DE, CoBiDE and OXDE and performs similar to SaDE on the test set. AADE performs better than JADE and ACoDE on most of the metrics, but it does not perform worse than them in any instance. Overall, it can be concluded that AADE performs better or at least is comparable to other DE variants. This conclusion is consistent with the previous observation.
In addition, considering the scale of the RCNN-BiGRU network, even when the number of nodes in the GRU layer is set to the maximum value of 30, the total number of parameters in the entire network is only 11,000. Therefore, the optimized neural network model in this paper is a very small-scale model, which is suitable for constructing time-series models of electricity consumption for a large number of customers. Moreover, the AADE algorithm proposed in this paper can quickly find the best network model with a certain amount of computational power, eliminating the significant burden of manual parameter tuning.

6. Conclusions

This paper tackles the critical challenge of missing electric energy data by introducing a novel data recovery model called RCNN-BiGRU, which integrates a CNN with a bidirectional GRU. The CNN network with a residual structure captures local dependencies and periodic patterns, while the BiGRU network models the temporal relationships of the data. To optimize this model, an accelerated adaptive differential evolution (AADE) algorithm is introduced as a matter of course. This algorithm features an accelerated mutation operator and an adaptive strategy for parameter setting, improving the efficiency and effectiveness of the optimization process.
The RCNN-BiGRU model is trained with actual power grid data, and the results demonstrate its superiority over those standalone CNN and GRU networks. Furthermore, comparative experiments with other optimization algorithms indicate that the AADE algorithm significantly enhances data recovery performance on both training and test sets, making it a robust solution for accurate electric energy data reconstruction.
This paper provides grid operators with a feasible, cost-effective and more accurate solution for the recovery of missing data. This is beneficial for implementing lean management practices in power grid enterprises, including concurrent line loss analysis, supply reliability assessment, and real-time equipment fault monitoring. Additionally, it offers greater assurance of data reliability for power billing, customer service, and government decision making.
Meanwhile, it should be noted that the research on the practical data recovery in this paper can be extended in a few directions.
CNN and GRU networks are combined to model the time series of electricity consumption. Future research will focus on whether there are neural network architectures with smaller scales and stronger fitting performance. Developing more powerful optimization algorithms is also a broad area for future research. However, our algorithm can be improved in a few aspects. A more useful difference-based mutation approach is an interesting research direction in the future. In particular, the theoretical analysis of the evolution behavior is arduous but worth exploring.
The training of neural networks is a complex process with numerous training parameters whose effects on the training results are difficult to predict, so the optimal selection range for each parameter is sometimes difficult to determine. This significantly increases the difficulty of incorporating them into the optimization problems and limits the potential of these training parameters as independent optimization variables.
The data recovery method proposed in this paper is an attempt at the hyperparametric optimization of neural networks in which the network structure parameter (number of GRU nodes) and training parameter (learning rate) are viewed as the independent variables to be optimized. In future research, more hyperparameters can be included in the optimization variables, and their corresponding optimization intervals can be determined through theoretical analysis and practical experience. This would allow for finding better network models across a broader range.

Author Contributions

Methodology, Y.X. and Y.D.; software, Z.X. and X.K.; validation, Y.X., Y.D., C.L. and Z.X.; formal analysis, Y.X., Y.D., C.L. and Z.X.; writing—original draft preparation, Y.X. and X.K.; writing—review and editing, X.K.; project administration, Y.X. and X.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Yukun Xu, Yuwei Duan, Chang Liu and Zihan Xu were employed by the State Grid Shanghai Municipal Electric Power Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
M A E Mean Absolute Error
M A P E Mean Absolute Percentage Error
M S E Mean Square Error
R M S E Root Mean Square Error
R 2 R-squared
A R 2 Adjusted- R 2

References

  1. Celebi, E.; Fuller, J.D. Time-of-use pricing in electricity markets under different market structures. IEEE Trans. Power Syst. 2012, 27, 1170–1181. [Google Scholar] [CrossRef]
  2. Hung, Y.C.; Michailidis, G. Modeling and optimization of time-of-use electricity pricing systems. IEEE Trans. Smart Grid 2018, 10, 4116–4127. [Google Scholar] [CrossRef]
  3. Kim, W.; Koo, J.; Jeong, J. Fine directional interpolation for spatial error concealment. IEEE Trans. Consum. Electron. 2006, 52, 1050–1056. [Google Scholar]
  4. Deng, W.; Guo, Y.; Liu, J.; Li, Y.; Liu, D.; Zhu, L. A missing power data filling method based on improved random forest algorithm. Chin. J. Electr. Eng. 2019, 5, 33–39. [Google Scholar] [CrossRef]
  5. Ding, Z.; Mei, G.; Cuomo, S.; Li, Y.; Xu, N. Comparison of estimating missing values in iot time series data using different interpolation algorithms. Int. J. Parallel Program. 2020, 48, 534–548. [Google Scholar] [CrossRef]
  6. Mo, T.; Sun, X.; Wang, B. Self-supervised seismic data interpolation via frequency extrapolation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5914509. [Google Scholar] [CrossRef]
  7. Yang, H.; Cao, J.; Chen, X. Interpolation of irregularly sampled seismic data via non-convex regularization. J. Appl. Geophys. 2023, 215, 105073. [Google Scholar] [CrossRef]
  8. Pan, L.; Li, J. K-nearest neighbor based missing data estimation algorithm in wireless sensor networks. Wirel. Sens. Netw. 2010, 2, 115–122. [Google Scholar] [CrossRef]
  9. Paik, J.W.; Hong, W.; Lee, J.H. Direction-of-departure and direction-of-arrival estimation algorithm based on compressive sensing: Data fitting. Remote Sens. 2020, 12, 2773. [Google Scholar] [CrossRef]
  10. Zhong, Y.; Li, C.; Li, Z.; Duan, X. A proximal-based algorithm for piecewise sparse appromation with application to scattered data fitting. Int. J. Appl. Math. Comput. Sci. 2022, 32, 671–682. [Google Scholar] [CrossRef]
  11. James, J.Q.; Hill, D.J.; Li, V.O.; Hou, Y. Synchrophasor recovery and prediction: A graph-based deep learning approach. IEEE Internet Things J. 2019, 6, 7348–7359. [Google Scholar]
  12. Chai, X.; Gu, H.; Li, F.; Duan, H.; Hu, X.; Lin, K. Deep learning for irregularly and regularly missing data reconstruction. Sci. Rep. 2020, 10, 3302. [Google Scholar] [CrossRef]
  13. Wu, H.; Li, S.; Liu, N. Seismic interpolation via multi-scale HU-Net. Geoenergy Sci. Eng. 2023, 222, 211458. [Google Scholar] [CrossRef]
  14. Yadav, H.; Thakkar, A. NOA-LSTM: An efficient LSTM cell architecture for time series forecasting. Expert Syst. Appl. 2024, 238, 122333. [Google Scholar] [CrossRef]
  15. Dao, F.; Zeng, Y.; Qian, J. Fault diagnosis of hydro-turbine via the incorporation of bayesian algorithm optimized CNN-LSTM neural network. Energy 2024, 290, 130326. [Google Scholar] [CrossRef]
  16. Tu, B.; Bai, K.; Zhan, C.; Zhang, W. Real-time prediction of ROP based on GRU-Informer. Sci. Rep. 2024, 14, 2133. [Google Scholar] [CrossRef]
  17. Akilan, T.; Baalamurugan, K.M. Automated weather forecasting and field monitoring using GRU-CNN model along with IoT to support precision agriculture. Expert Syst. Appl. 2024, 249, 123468. [Google Scholar] [CrossRef]
  18. El-Assy, A.M.; Amer, H.M.; Ibrahim, H.M.; Mohamed, M.A. A novel CNN architecture for accurate early detection and classification of Alzheimer’s disease using MRI data. Sci. Rep. 2024, 14, 3463. [Google Scholar] [CrossRef] [PubMed]
  19. Song, B.; Liu, Y.; Fang, J.; Liu, W.; Zhong, M.; Liu, X. An optimized CNN-BiLSTM network for bearing fault diagnosis under multiple working conditions with limited training samples. Neurocomputing 2024, 574, 127284. [Google Scholar] [CrossRef]
  20. Storn, R.; Price, K. Differential evolution—A simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 1997, 11, 341–359. [Google Scholar] [CrossRef]
  21. Yang, Q.; Qiao, Z.-Y.; Xu, P.; Lin, X.; Gao, X.-D.; Wang, Z.-J.; Lu, Z.-Y.; Jeon, S.-W.; Zhang, J. Triple competitive differential evolution for global numerical optimization. Swarm Evol. Comput. 2024, 84, 101450. [Google Scholar] [CrossRef]
  22. Sui, Q.; Yu, Y.; Wang, K.; Zhong, L.; Lei, Z.; Gao, S. Best-worst individuals driven multiple-layered differential evolution. Inf. Sci. 2024, 655, 119889. [Google Scholar] [CrossRef]
  23. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  24. He, F.; Liu, T.; Tao, D. Why resnet works? residuals generalize. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 5349–5362. [Google Scholar] [CrossRef]
  25. Price, K.V.; Storn, R.M.; Lampinen, J.A. Differential Evolution: A Practical Approach to Global Optimization; Natural Computing Series; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
  26. Zhang, J.; Sanderson, A.C. JADE: Adaptive differential evolution with optional external archive. IEEE Trans. Evol. Comput. 2009, 13, 945–958. [Google Scholar] [CrossRef]
  27. Pan, Q.K.; Suganthan, P.N.; Wang, L.; Gao, L.; Mallipeddi, R. A differential evolution algorithm with self-adapting strategy and control parameters. Comput. Oper. Res. 2011, 38, 394–408. [Google Scholar] [CrossRef]
  28. Wang, Y.; Cai, Z.; Zhang, Q. Enhancing the search ability of differential evolution through orthogonal crossover. Inf. Sci. 2012, 185, 153–177. [Google Scholar] [CrossRef]
  29. Wang, Y.; Li, H.X.; Huang, T.; Li, L. Differential evolution based on covariance matrix learning and bimodal distribution parameter setting. Appl. Soft Comput. 2014, 18, 232–247. [Google Scholar] [CrossRef]
  30. El-Aswad, A.F.; Mohamed, A.E.; Fouad, M.R. Investigation of dissipation kinetics and half-lives of fipronil and thiamethoxam in soil under various conditions using experimental modeling design by Minitab software. Sci. Rep. 2024, 14, 5717. [Google Scholar] [CrossRef]
  31. Wang, B.C.; Li, H.X.; Li, J.P.; Wang, Y. Composite differential evolution for constrained evolutionary optimization. IEEE Trans. Syst. Man Cybern. Syst. 2018, 49, 1482–1495. [Google Scholar] [CrossRef]
Figure 1. Network architecture of basic GRU.
Figure 1. Network architecture of basic GRU.
Mathematics 12 02686 g001
Figure 2. Network architecture of bidirectional GRU.
Figure 2. Network architecture of bidirectional GRU.
Mathematics 12 02686 g002
Figure 3. Residual structure.
Figure 3. Residual structure.
Mathematics 12 02686 g003
Figure 4. RCNN-BiGRU composite network model.
Figure 4. RCNN-BiGRU composite network model.
Mathematics 12 02686 g004
Figure 5. Flowchart of AADE-RCNN-BiGRU.
Figure 5. Flowchart of AADE-RCNN-BiGRU.
Mathematics 12 02686 g005
Figure 6. Comparison of metrics on the training set.
Figure 6. Comparison of metrics on the training set.
Mathematics 12 02686 g006
Figure 7. Comparison of metrics on the test set.
Figure 7. Comparison of metrics on the test set.
Mathematics 12 02686 g007
Figure 8. Comparsions between models with CNN and those without CNN.
Figure 8. Comparsions between models with CNN and those without CNN.
Mathematics 12 02686 g008
Figure 9. Comparisons between models with unidirectional GRU and those with bidirectional GRU.
Figure 9. Comparisons between models with unidirectional GRU and those with bidirectional GRU.
Mathematics 12 02686 g009
Figure 10. Comparisons between CNN models with residual structure and those without.
Figure 10. Comparisons between CNN models with residual structure and those without.
Mathematics 12 02686 g010
Figure 11. Factor level trend of AADE-RCNN-BiGRU on training set.
Figure 11. Factor level trend of AADE-RCNN-BiGRU on training set.
Mathematics 12 02686 g011
Figure 12. Factor level trend of AADE-RCNN-BiGRU on test set.
Figure 12. Factor level trend of AADE-RCNN-BiGRU on test set.
Mathematics 12 02686 g012
Figure 13. Comparison of metrics on the training set.
Figure 13. Comparison of metrics on the training set.
Mathematics 12 02686 g013
Figure 14. Comparison of metrics on the test set.
Figure 14. Comparison of metrics on the test set.
Mathematics 12 02686 g014
Table 1. Comparison of the training set metrics.
Table 1. Comparison of the training set metrics.
MAE MAPE MSE RMSE R 2 AR 2
GRU4.2210.07428.2435.1830.7170.717
BiGRU3.1600.05616.7943.9520.8320.832
CNN3.5010.06223.8704.2900.7610.761
RCNN3.3510.06018.6164.1860.8140.814
CNN-GRU3.3360.05918.6284.2260.8130.813
CNN-BiGRU2.7020.04812.5653.4630.8740.874
RCNN-GRU2.3570.0429.3053.0170.9070.907
RCNN-BiGRU2.2050.0398.0862.8190.9190.919
Table 2. Comparison of the test set metrics.
Table 2. Comparison of the test set metrics.
MAE MAPE MSE RMSE R 2 AR 2
GRU4.6120.135.3535.750.5110.51
BiGRU3.6480.07922.3244.5710.6910.691
CNN4.0310.08735.5994.9520.5080.507
RCNN3.820.08224.0084.7710.6680.668
CNN-GRU4.1970.09229.0045.2290.5990.598
CNN-BiGRU3.4280.07419.9794.3590.7240.723
RCNN-GRU3.0330.06714.7793.7570.7960.795
RCNN-BiGRU2.7040.05911.8053.3870.8370.837
Table 3. Significance of the parameters.
Table 3. Significance of the parameters.
Level numGru maxEpoach learnRate
11.80315.29142.0081
21.0081.33121.0898
31.3060.68831.161
41.58550.51431.3494
52.58390.46122.6783
Delta1.57584.83021.5884
Rank312
Table 4. Parameter setting for the DE variants.
Table 4. Parameter setting for the DE variants.
VariantParameter Setting
AADET = 5; p R 1 = 0.1
DEF = 0.9; C R = 0.6
JADEc = 0.1; p = 0.05; C R m = 0.5; F m = 0.5; A f a c t o r = 1
SaDE n u m s t = 4; l e a r n g e n = 50
ACoDE L E P = 50
CoBiDE p b = 0.4; p s = 0.5
OXDEQ = 3; J = 2; F = 0.9; C R = 0.9
Table 5. Regression performance metrics for the training set.
Table 5. Regression performance metrics for the training set.
MAE MAPE MSE RMSE R 2 AR 2 Mean Rank
AADE0.9531 (2)0.0169 (2)1.5661 (3)1.2451 (3)0.9843 (3)0.9843 (3)2.67
DE0.9538 (4)0.0171 (4)1.5482 (1)1.2383 (1)0.9845 (1)0.9845 (1)2
JADE0.9494 (1)0.0169 (1)1.5564 (2)1.2383 (2)0.9844 (2)0.9844 (2)1.67
SaDE0.9535 (3)0.0169 (3)1.5698 (4)1.2451 (4)0.9843 (4)0.9843 (4)3.67
ACoDE1.0088 (7)0.0180 (7)1.7275 (7)1.2983 (7)0.9827 (7)0.9827 (7)7
CoBiDE0.9625 (6)0.0172 (6)1.5872 (6)1.2489 (6)0.9841 (6)0.9841 (6)6
OXDE0.9594 (5)0.0172 (5)1.5814 (5)1.2478 (5)0.9842 (5)0.9842 (5)5
Table 6. Regression performance metrics for the test set.
Table 6. Regression performance metrics for the test set.
MAE MAPE MSE RMSE R 2 AR 2 Mean Rank
AADE1.2645 (1)0.0266 (1)2.7275 (1)1.6417 (1)0.9623 (1)0.9622 (1)1
DE1.3116 (4)0.0279 (6)3.0233 (5)1.7317 (5)0.9582 (5)0.9581 (5)5
JADE1.2668 (2)0.0267 (2)2.8792 (3)1.6866 (4)0.9602 (3)0.9601 (3)2.83
SaDE1.2853 (3)0.0273 (3)2.7963 (2)1.6628 (2)0.9613 (2)0.9613 (2)2.33
ACoDE1.3128 (5)0.0279 (4)2.9087 (4)1.6860 (3)0.9598 (4)0.9597 (4)4
CoBiDE1.3189 (6)0.0281 (7)3.0833 (6)1.7427 (6)0.9574 (6)0.9573 (6)6.17
OXDE1.3224 (7)0.0279 (5)3.0877 (7)1.7434 (7)0.9573 (7)0.9572 (7)6.67
Table 7. Results of rank-sum tests for AADE with other DE variants on training set.
Table 7. Results of rank-sum tests for AADE with other DE variants on training set.
AADEDEJADESaDEBCoBiDEOXDE
M A E 000110
M A P E 000100
M S E −1−10111
R M S E 110100
R 2 000100
A R 2 000100
+110621
446045
110000
Table 8. Results of rank-sum tests for AADE with other DE variants on test set.
Table 8. Results of rank-sum tests for AADE with other DE variants on test set.
AADEDEJADESaDEBCoBiDEOXDE
M A E 100111
M A P E 100111
M S E 110111
R M S E 110011
R 2 110111
A R 2 110111
+640566
026100
000000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, Y.; Duan, Y.; Liu, C.; Xu, Z.; Kong, X. Recovery Model of Electric Power Data Based on RCNN-BiGRU Network Optimized by an Accelerated Adaptive Differential Evolution Algorithm. Mathematics 2024, 12, 2686. https://doi.org/10.3390/math12172686

AMA Style

Xu Y, Duan Y, Liu C, Xu Z, Kong X. Recovery Model of Electric Power Data Based on RCNN-BiGRU Network Optimized by an Accelerated Adaptive Differential Evolution Algorithm. Mathematics. 2024; 12(17):2686. https://doi.org/10.3390/math12172686

Chicago/Turabian Style

Xu, Yukun, Yuwei Duan, Chang Liu, Zihan Xu, and Xiangyong Kong. 2024. "Recovery Model of Electric Power Data Based on RCNN-BiGRU Network Optimized by an Accelerated Adaptive Differential Evolution Algorithm" Mathematics 12, no. 17: 2686. https://doi.org/10.3390/math12172686

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop