1. Introduction
With the proposal of the strategic goal of “carbon peaking and carbon neutrality” in China and the large-scale access of new energy power generation, the traditional power production model has been broken, and the planning, management, and dispatch of power production have become increasingly complex [
1]. The data of the power grid presents the characteristics of multisource heterogeneity, scattered data, large scale, rapid change, and many types [
2]. The acquisition of distribution data is the basis for the analysis of distribution network operation [
3]. The use of a feeder terminal unit (FTU), distribution terminal unit (DTU), and the emerging transformer terminal unit (TTU) in recent years has caused the collection of multistate measurement data in the distribution network to be realized in an orderly manner. At the same time, effective data cleaning of these measurement data can provide high-quality data sources for the realization of multitype protection and fault isolation transparency and online safety checking of the distribution network [
4,
5]. Therefore, the cleaning of abnormal data is greatly important in practical engineering applications [
1].
Data cleaning is a method used to detect and eliminate errors and inconsistencies in data [
6]. The power grid can be simplified into a physical topology structure composed of power generation nodes and power transmission networks. A strong correlation exists between each node and line through complex physical connections. At the same time, due to the inertia of power system equipment and system inertia, the power system operation data have time correlation in a continuous period [
7]. At present, the research on the identification and processing of abnormal data in the power system can be divided into three categories according to the temporal and spatial correlation analysis of the measured data. The first category is to clean the data according to the temporal correlation of the measured data. The network filters the sampled signal from the DC system, and the wavelet neural network can organically combine the time–frequency local characteristics of the wavelet transform with the self-learning and adaptive nature of the neural network. Literature [
8] proposed a method to convert the detection of similar and repeated records of text data into the detection of similar and repeated records of its binary strings; it provided the daily load data on mutual cleaning for the distribution network, showing the similarity of the daily cycle to ideas. Literature [
9] proposed a power grid data cleaning and fusion algorithm based on a time series similarity measure, which uses symbol aggregation, the Euclidean algorithm, and similar sequences to adjust similarity weighting to complete cleaning; it uses a distributed Kalman filtering algorithm to complete data fusion. However, the cleaning algorithm requires a relatively large amount of calculation and is unsuitable for distributed small computing power equipment.
The second type is to complete data cleaning according to the spatial correlation of measurement data. Literature [
10,
11] proposed a state estimation model based on a deep neural network and a combined particle filter and the convolutional neural network state of the power system. Estimation models demonstrate the spatial correlation of measurement data. Literature [
12] used the correlation between fault remote signaling data to group remote signaling displacement data, and then transform the fault diagnosis problem of remote signaling displacement data into the classification problem of sample data in multidimensional space. Literature [
13] proposed a load data repair method based on a collaborative filtering recommendation algorithm, which calculates the load range recommendation degree according to the horizontal correlation of load changes within the distribution network area and realizes rapid data correction for abnormal loads. Literature [
14] proposed a system protection method for an integrated pipe gallery power cabin based on multisource heterogeneous data fusion. The method initially integrates multiple distributed data sources from the pipe gallery power cabin using middleware technology for data layer fusion. Then, the data with similar characteristics are divided into subspaces, and the proposed LGP method is used to extract features in each subspace. Finally, the features extracted in each subspace are fused, fully using the spatial correlation of multiple distributed data sources. In fact, using data correlation for data cleaning cannot sufficiently discriminate dirty data from time or space, and the correlation of data should be measured from the dual correlation of time and space [
15].
Literature [
16] used the two-way comparison method to identify and process abnormal data in spatial power load forecasting. This method uses the load variation between the loads at the previous moment as the criterion for judging whether the data are abnormal for horizontal comparison; it also simultaneously processes the abnormal data. For multiyear data, after judging the abnormal values in the load data of each year at the same time, the average value of the normal data is used as the correction value for horizontal comparison. Literature [
17] proposed a multilevel cleaning and identification method of measurement data based on the spatial–temporal correlation characteristics of data. The second-level data identification is carried out according to the time–series correlation of the measurement data, and the convolutional neural network is used to establish a spatial–physical correlation model for the third-level data identification. However, these articles on the spatiotemporal correlation of measurement data ignore the measurement deviation caused by the different spatial positions of the measurement equipment when considering the spatial correlation of the measurement equipment.
Toward this end, in this paper, a joint spatiotemporal cleaning technique for distribution network measurement data based on CC-VC-UKF is proposed. In the presence of spatial measurement deviations, the FTU measurement sequence is filtered based on the CC-VC-UKF algorithm with the TTU measurement data sequence as the reference sequence, and the FTU measurement data with higher accuracy can be obtained from the filtered data sequence, and measurement deviations according to the network topology.
The rest of this paper is organized as follows: The related work is discussed in
Section 2; the proposed CC-VC-UKF method is presented in
Section 3; combined spatiotemporal cleaning of measurement data in the distribution network based on the CC-VC-UKF algorithm is presented in
Section 4; extensive simulations are conducted in
Section 5; and finally, the conclusion is given in
Section 6.
2. Related Work
Even if the electrical quantities of the same line are measured by multiple measuring devices, the measurement results of the multiple measuring devices may produce measurement deviations with a non-zero mean and non-Gaussian distribution due to the spatial differences in the measuring devices. To solve the problem of a non-zero mean and non-Gaussian distribution measurement deviation caused by the spatial distance of the measurement equipment, the CC-VC-UKF algorithm is designed to measure the same section of a line in different spaces. The measurement data of the measurement equipment at the location are filtered and cleaned, and some abnormal data of the FTU are removed through the data change trend of the TTU under the inherent existence of the spatial measurement deviation to ensure the quality of the FTU measurement data.
In this work, we use the flexible center position of the variable center (VC) cross-correlation entropy and the deviation distribution that matches the non-zero mean value to reduce the calculation force requirement and solve the measurement deviation defect of a non-zero mean and non-Gaussian distribution caused by the spatial inconsistency of measuring equipment. According to the characteristics of [
18], a spatiotemporal joint cleaning technology of distribution network measurement data based on the correntropy criterion with variable center unscented Kalman filter (CC-VC-UKF) is proposed. The K-means algorithm is used to cluster and weigh the historical data [
19] to obtain the reference sequence for data error correction. The accuracy of FTU measurement data in the presence of a zero mean non-Gaussian measurement deviation provides a new idea for data cleaning a distribution network.
4. Combined Spatiotemporal Cleaning of Measurement Data in the Distribution Network Based on the CC-VC-UKF Algorithm
Complex physical connections exist between various nodes and lines in the power grid, with a strong correlation [
23]. Therefore, the measurement data collected by the measurement device also have a time and space correlation, as shown in
Figure 3.
The active daily load curve of FTU measurement for 30 days is obtained from the field measured data, as shown in
Figure 4. The change trend of the measured data changes periodically in time. The time series shows that the observation points after a time interval are similar, and the daily load curves of adjacent days are similar. At the same time, network topology constraints and power flow constraints are found between various nodes of the power grid [
24]. In addition, as shown in the schematic topology diagram of the power flow in
Figure 5, the measurement data have a spatial correlation. Therefore, this study introduces the variable center cross-correlation entropy function into the unscented Kalman filter algorithm and performs filtering and cleaning based on the spatiotemporal correlation of the measurement data, which allows for significant improvements in the accuracy and data quality of the final cleaned data in the presence of non-zero mean non-Gaussian distributed spatial measurement deviations.
The specific process of the multilevel data filtering and cleaning technology based on space and time is shown in
Figure 6. In the whole process, all filtering algorithms adopt the CC-VC-UKF algorithm to deal with the measurement deviation of a non-zero mean and non-Gaussian distribution. The measurement data of the measurement equipment at different positions of the line are temporally filtered according to the respective reference time series to improve the quality of the measurement data. After filtering the data by the measuring equipment, the spatial measurement deviation after filtering and the line end measurement data after secondary filtering can be obtained. Finally, according to the network topology relationship in the figure, the measurement data of the head end of the line can be obtained after the spatiotemporal joint filtering and cleaning.
4.1. Obtain a Reference Time Series
Prior to performing spatial filtering and fusion, the measurement data of the measurement equipment should be filtered according to the reference time series to reduce the volatility of the measurement data and improve the signal-to-noise ratio of the measurement data. The reference time series needs to be classified and weighted by the K-means clustering algorithm. The similarity measurement can rapidly determine the measurement equipment of the data with relatively small deviation from the spatial measurement of the data sequence to be cleaned. Therefore, the K-means clustering algorithm and similarity measure are described as follows:
4.1.1. K-Means Clustering
Although a certain periodicity is found in the load-day characteristic of the load, only the changing trend is periodic. The load difference between different days is still relatively large due to the influence of various factors. Therefore, to obtain the reference sequence, the historical data should be clustered and divided into six categories. The errors caused by the clustering can be ignored when the experimental verification data have been divided into six categories.
K-means algorithm clustering is performed on historical data. The K-means algorithm flow is shown in Algorithms 1:
Algorithms 1 K-Means Algorithm Flow |
Input: | Historical time series |
Output: | Classification result |
1 | Select the initial k samples as the initial cluster center
|
2 | For each sample in the dataset, its distance is calculated to k cluster centers and classified into the class corresponding to the cluster center with the smallest distance. |
3 | For each category, its cluster center is recalculated. |
4 | Repeat steps 2 and 3 for 10 iterations. |
In the K-means algorithm, if the value of
is extremely small, that is, the number of classifications is extremely small, then each category will have more data, and the load characteristics of different days cannot be distinguished effectively, thereby affecting the filtering results. If the value of
K is extremely high, that is, the number of classifications is extremely high, then each category will have extremely less data, thereby magnifying the influence caused by the accidental measurement error; it will also be detrimental to the filtering result. Therefore, the value principle of
K should be as large as possible while satisfying the accuracy requirements. The value method of
K is shown in Algorithms 2:
Algorithms 2 |
Input: | TTU history sequence |
Output: | Value of k |
1 | The TTU sequence of a certain day is used to filter the FTU sequence of the day to obtain the filtered sequence FTU1 and calculate the difference between the mean of FTU1 and the mean of FTU. |
2 | A number is added to the data in the TTU sequence to obtain TTU2 and use TTU2 to filter the FTU to obtain the filtered sequence FTU2; then, the mean value of FTU2 and the mean difference of FTU are calculated. |
3 | is continuously adjusted until , and is written down at this time. |
4 | The TTU sequences of all days are averaged, and the difference between the maximum value and the minimum value is used to obtain the range R of the TTU sequence. |
5 | The number of categories k that can be obtained by . |
4.1.2. Similarity Measurement
The Minkowski distance is one of the most widely used algorithms in similarity measurement. It requires that the two sequences to be compared have the same length and that the points of the two time series correspond one-to-one [
23] to rapidly calculate the Minkowski distance between the two sequences.
Among them, A and C are the two time series, and and are the i-th point of A and C sequences, respectively. When p = 2, the Minkowski distance is the Euclidean distance. The distance between the sequence to be cleaned and the reference sequences of various types is compared, and the one with the smallest distance is considered the reference sequence. In this study, the closest distance to the time sequence to be cleaned is the TTU measurement time series.
4.2. Space–Time Joint Cleaning Based on CC-VC-UKF Algorithm
4.2.1. Filtering Based on Reference Time Series
(1) The reference time series corresponding to FTU and TTU measurement data is obtained by K-means algorithm clustering. Taking FTU as an example, the state equation is fitted according to the reference time series. The specific process of fitting is as follows:
First, the measurement value of the equipment is set as the ordinate value. The measurement time interval of the measurement equipment is every 5 min, with 288 measurement time points in one day, and the unit of the abscissa is set to 1. Thus, the trigonometric function relationship between measurement data values and measurement time points can be fitted, as follows:
where
represents the FTU measurement reference time series data value at time
.
(2) After mathematical derivation of the trigonometric function relationship, the measured data relationship at two adjacent moments can be obtained as the fitted state equation, as follows:
where
represents the FTU measurement reference time series data value at time
,
represents the conversion relationship between the variables at time
, and time
,
represents the deviation between the FTU reference time series value at time
and the reference time series value predicted from time
. Then, the FTU and TTU measurement data are filtered based on the CC-VC-UKF algorithm.
4.2.2. Cleaning Based on the Reference Sequence of TTU Measurement
When the FTU and TTU measurement data are filtered, the FTU measurement data can be cleaned according to the TTU measurement data as a reference sequence. Taking the measured active power as an example, the power flow relationship between FTU and TTU is shown in
Figure 7 in the measured data obtained from a city’s electric power company.
Therefore, the equation of state and the measurement equation can be set as follows:
where
represents the TTU measurement data value at time
,
f represents the conversion formula of the TTU measurement data value at time
and time
,
represents the FTU measurement data value at time
,
represents the deviation of the TTU measurement data value between time
and time
, and
represents the spatial measurement deviation of time
. The distribution histogram of
is shown in
Figure 8, intuitively indicating that
is a non-zero mean non-Gaussian distribution at this time. In addition, spatial measurement deviations are theoretically inherent, including line losses between measurement devices and measurement errors in the measurement devices themselves. However, the CC-VC-UKF algorithm proposed in this paper is not designed to eliminate these deviations, but rather to eliminate some of the FTU anomalies through the trending of the TTU data in the form of CC-VC-UKF-based filtering.
After filtering by the CC-VC-UKF algorithm, is set as the measured data value of the TTU measurement data after filtering, and is the spatial measurement deviation after filtering. Then, the measured data value of the FTU after joint spatiotemporal cleaning can be expressed as .
5. Numerical Results
5.1. Original Data Analysis
The data types and data volumes of different types of power grids vary greatly. For the distribution network data, the distribution network measurement data received from a municipal bureau in a certain northern region are obtained, the sampling period of the measuring equipment is five minutes, and a daily load data series contains 288 points. It also includes data sets of various measurement equipment, such as FTU and TTU.
The communication of the measuring equipment in the distribution network is incompletely reliable; thus, the error data types are mostly 0 values.
Figure 9 shows an example of FTU measurement data for a day in the distribution network, where the red and blue lines indicate the zero and non-zero values of the measurement data, respectively. The FTU data of the distribution network obtained this time will be zero every day from 1:00 to 4:00 in the morning.
After obtaining the raw data of FTU, the 0 value in the early morning must be preprocessed first. No particularly good interpolation algorithm is available due to the large number and concentration of missing values. However, the load is relatively stable in the early morning, and the variation is small. Thus, the method of inserting the nearest value can be used to repair the 0 value in the early morning.
In the same way, the long-term on-site measured data of a certain section of the distribution network can be preprocessed, and the FTU at point A and different measurement equipment at other points can measure the active daily load curve for one month. Taking FTU as an example, the K-means algorithm clustering is performed on its active daily load curve, and the classification of the obtained active daily load curve is shown in
Figure 10.
5.2. Experimental Result
Figure 6 shows the third type of load curve as an example, indicating the active daily load curve for the 12th day to the 19th day, 21st day, and 22nd day for the FTU measurement. The FTU measurement active daily load curve on the 14th day is considered the active power curve to be cleaned up. Then, the average value of the remaining active daily load curves is obtained as a reference time series. Finally, the active power curve to be cleaned is filtered based on the CC-VC-UKF algorithm. As shown in
Figure 11, the random disturbance of the filtered measurement data is reduced, and the data quality is improved.
In the same way, the TTU measurement active power curve determined by the shortest Euclidean distance is filtered based on its reference time series, and the result is shown in
Figure 12.
According to the spatiotemporal joint cleaning technique described in
Section 3.2, the FTU measurement data sequence is filtered based on the CC-VC-UKF algorithm with the TTU measurement data as the reference data sequence. The spatial measurement deviation before and after filtering is compared, as shown in
Figure 13.
Finally, after the joint space–time filtering and cleaning, the active power curve measured by the FTU is shown in
Figure 14.
The calculation formula of the spatial measurement deviation is
, where
represents the TTU measurement data value at time
, and
represents the FTU measurement data value at time
. At the same time, the power flow relationship between FTU and TTU in
Figure 7 shows that the spatial measurement deviation between FTU and TTU should theoretically be positive. The power system will be attacked by dirty data to varying degrees due to the influence of the actual site, thereby causing a large deviation in the measurement data of the measurement equipment. Thus, the spatial measurement deviation may be negative. This study selects the measurement data of FTU and TTU from 3:00 to 7:00 on the 30th day, and the situation is that the spatial measurement deviation is negative at this time. The comparison of spatial measurement deviation before and after spatiotemporal joint cleaning using the CC-VC-UKF algorithm is shown in
Figure 15.
The figure shows that after filtering based on the CC-VC-UKF algorithm, the negative part of the spatial measurement deviation that deviates from the normal range is transformed into a more reasonable normal value, and the volatility of measurement deviation is reduced, showing that the proposed algorithm has a better filtering effect on spatial measurement deviation under a dirty data attack.
5.3. Experimental Result Verification
The original data sequence with no more than 1% vacancy is selected as the experimental set, and the data are set to be empty; the data sequence is cleaned according to a random proportion. Then, the proposed method is used to clean the data sequence, and the null values in the data sequence are used as the predicted values of the gap after data cleaning. The gap predictions are compared with the original values, and the mean absolute percentage error (MAPE), root mean square error (RMSE) [
25], mean absolute error (MAE) [
11], and signal-to-noise ratio (
SNR) are calculated. The formula for calculating SNR is as follows:
where
represents the original data value corresponding to the moment when the set data value is empty,
represents the predicted value of the vacancy, and
represents the average value of the original data value.
This study compares the data cleaning of the random forest algorithm [
26], the LSTM algorithm [
27], and the deep neural network algorithm [
28]. It also applies the CC-VC algorithm to the traditional extended Kalman filter (EKF). The comparative experimental results are shown in
Table 1. In addition, to display the comparison results more clearly, the corresponding data cleaning indicator radar chart is shown in
Figure 16. The figure shows that compared with the advanced algorithms published in the past two years, the proposed data cleaning algorithm still has certain advantages in improving data cleaning accuracy and data quality, which is helpful for the processing and storage of measurement data by subsequent power distribution terminals.
In addition, for the measurement of data sequences at different measurement time points, the Gaussian filtering algorithm, the wavelet threshold algorithm [
5], and the algorithm in this study are used to filter the calculated time consumption. The data are completed on a computer with a platform CPU of 2.1 GHZ and a memory of 16 GB to study the time complexity of the proposed algorithm. The results are shown in
Table 2.
The table shows that when the measurement time point is 60, the CC-VC-UKF algorithm consumes about 95% more time than the wavelet threshold algorithm. When the measurement time point is 576, the corresponding consumption time increases by about 28%. This finding shows that with the increase in measurement time points, the time complexity gap of the CC-VC-UKF algorithm compared with other algorithms gradually narrows. When the scale of the measurement data is sufficiently large, the difference of the time complexity of the algorithm compared with other algorithms can be ignored.
In addition, given that the proposed algorithm is aimed at offline data, it can satisfy the data cleaning requirements without considering the real-time performance.