1. Introduction
The issue of missing data in datasets has been a persistent challenge across various domains, significantly affecting the reliability of analyses and decision-making processes in smart grid applications. In the field of electricity, the deployment of smart meters has facilitated the collection of vast quantities of data, which are instrumental for billing, operational control, and maintenance of power distribution networks [
1]. Those datasets are not immune to the occurrence of missing data, primarily due to hardware failures, communication disruptions, and software glitches, posing substantial obstacles to the accurate estimation of electricity consumption patterns and the efficient operation of the power grid [
1].
Dealing with missing data in electrical consumption datasets is particularly challenging, especially when the sample period is on the order of minutes or hours [
1]. The difficulties are related to high dimensionality, temporal structure, and the presence of both short-term and long-term gaps. User consumption has a strong correlation on previous records; for instance, the consumption at 10 am is closely related to the consumption at 9 am. This strong relationship offers advantages, as traditional statistical tools like interpolation can estimate consumption over short time intervals [
1]. However, this approach does not account for macro variables, such as the type of day (working day or not) or external events, and it is inadequate for missing data spanning more than three records.
Moreover, sometimes user consumption patterns or daily consumption waveforms must also be considered. For example, in demand response studies like Time of Use pricing, it is necessary to consider user consumption patterns, since hourly rates are based on these profiles [
2]. A holistic approach to data imputation in consumption profiles must account for the waveform, both for short-term gaps (within 3 h) and longer-term gaps (exceeding 4 h), including entire days.
There are at least three alternatives for handling missing data in a dataset: (1) delete incomplete records, (2) make it explicit that the data are missing, and (3) impute data. Deleting results in the loss of information about the available attributes of incomplete records, and making missing data explicit is only possible for categorical attributes [
3]. The third alternative, imputation, is to replace missing data with substitute values.
There is an extensive and rich literature on data imputation methods. A high-level introduction can be found in [
4], and a more specialized explanatory text in [
5]. As a general approach, depending on the relationship between the data values and the probability of missing data, it is necessary to distinguish between three three cases: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR) [
6]. In this paper, we focus on power consumption datasets and assume that missing data are completely random. In other words, we have no idea why the data are missing.
Among the imputation strategies, statistical imputation and machine learning-based methods, such as Linear Imputation, k-Nearest Neighbors Imputation, and Variational Autoencoders, have been explored with varying degrees of effectiveness [
7,
8]. These methods, however, often struggle to fully capture the complex dynamics and temporal dependencies inherent in electric consumption data. A comprehensive comparison of imputation methods can be found in [
9]. The performance of 19 algorithms was tested on 15 real-world benchmark datasets. Algorithms are classified in three groups: statistical algorithms [
1,
7,
10,
11,
12], machine learning algorithms [
13,
14,
15,
16,
17,
18,
19], and deep learning algorithms [
20,
21,
22].
However, power consumption datasets have some distinctive characteristics that make them difficult to complete using conventional data imputation algorithms. First, of all, they are time series. Data imputation for time series requires a different approach, because each individual data are related to past and future data. If those relationships are known, they can be used to estimate the value of the missing data. For univariate time series, the simplest approach is to interpolate missing data, but this is only useful when the number of adjacent missing data are small. More sophisticated approaches are based on system identification techniques that search for error prediction models (i.e., ARIMA series) or space state models (i.e., Kalman filter) [
23]. For multivariate problems, there are also some approaches that rely on inter-attribute correlations to estimate values for the missing data [
24].
Second, power consumption time series usually exhibit multiple seasonalities. As smart meters allow us to measure consumption every 15 min, they capture daily, weekly, monthly, and yearly consumption behaviors, which may be quasi-periodic. Methods for imputing time series do not account for seasonality, or only account for one seasonality. As stated in [
9], it has been speculated that some non-recurrent Neural Network architectures (convolutional and self-attentional) have the ability to model long-range dependencies in the incomplete time series data, because they can connect distant values via shorter network paths.
Third, one of these seasonalities, daily seasonality, is very important for many electrical engineering applications. The description of daily consumption is known as the load profile, and it is critical for short-term demand forecasting, demand-side management programs, and time-of-use pricing programs, to name a few electrical engineering problems. We will refer to total profile imputation when all the data of a certain day is missed (the entire load profile is missed), and partial profile imputation when just some data of a certain day is missed.
Fourth, it is not uncommon for power consumption datasets to have large sets of adjacent missing records. Those data come from smart meters, and a failure in one meter can last hours or days, causing dozens of adjacent records to be missed. The entire load profile for adjacent days may be missed. Unfortunately, most time series imputation methods are useful only when the number of adjacent missing data are small.
In summary, power consumption datasets require specialized data imputation methods. For example, in [
25], an interpolation that takes into account the daily and weekly periodicity is shown. This method produces continuous profiles with respect to the adjacent available measurements, which is a highly desirable feature for power flow analyses.
In [
1], a set of eight load profiles is found from the available data, and total imputation is made by choosing the most feasible profile. The research highlights typical measurement errors, including short-term data shortages, defective data samples, and long-term data gaps. To tackle these issues, the authors use linear interpolation and other methods, such as spline functions and cubic interpolation, for estimating missing values. Additionally, they propose filters like median, Hampel, and Savitzky–Golay to reduce extreme disturbances in the data. The study models daily load profiles based on recorded measurements, which are then used to estimate missing data. Through a comparative analysis of various modeling approaches, the research concludes that the cyclic bi-Gaussian model offers a high match quality for predicting and filling in missing data, thus ensuring reliable load profile estimations and supporting efficient grid management.
In [
26], clustering of available load profiles is made and the resulting centroids are used to estimate the missed profile based on correlation distances. The study involved aggregating the daily load profiles of 100 domestic customers over a week, followed by clustering these profiles into several segments based on the k-means algorithm. The authors tested various time windows to determine the optimal segmentation for minimizing estimation errors. The study applied four distance functions—Euclidean, Manhattan, Canberra, and Pearson correlation—to evaluate the accuracy of the load estimates. The simulation results indicated that the Canberra distance function provided the most accurate load estimates, with the smallest mean absolute percentage error (MAPE) and root-mean-square error (RMSE). The methodology demonstrated robustness in estimating both short-term (within 3 h) and long-term (exceeding 4 h) gaps in load data, offering a reliable tool for managing partial imputation.
In [
27], denoising autoencoders are used to make partial imputation. The autoencoders are trained with moving windows of data to encode the shape of short-term trends. This study demonstrated the effectiveness of the denoising autoencoders by significantly reducing the root mean square error (RMSE) of imputation compared to traditional methods and other generative models like variational autoencoders and Wasserstein autoencoders. This method proved to be particularly beneficial for daily accumulated error metrics, reducing errors by up to 56% compared to other approaches, thus providing a highly accurate and reliable method for imputing missing values in electrical load data.
A different situation occurs when measurements from several meters are available. Missing data from one meter can be estimated using the information from the other meters. For example, in [
28], the imputation problem is formulated as a spatiotemporal problem with high sparsity. The spatial dimension is the geographic position of the meter. The problem is addressed using a robust estimator based on principal components pursuit (PCP).
Another example is in [
29]. An algorithm that utilizes power flow analysis to estimate missing values from smart meters is used. The methodology unfolds in three phases: detection, operation, and estimation. Initially, the system identifies a missing reading at any node. It then prompts adjacent nodes for additional data like voltage, current, and power factor with higher measurement resolution. Finally, using the gathered data from neighboring nodes and power equations, the missing values are estimated through numerical approximations. The authors highlight that this method surpasses traditional regional averaging and data imputation techniques, particularly in scenarios involving extended consumption periods or periodic consumption, such as holidays, by maintaining higher accuracy and reliability in data estimation.
In this paper, we address the total and partial profile imputation problems. We assume that the available data can be enriched with external information. However, we do not consider the use of geographic or topological information. The available data may come from the measurements of a single user or a group of users.
Our methodology for imputing the missing data from smart meters splits the problem into two subproblems: estimating the shape, and estimating the total daily consumption. To model the shape, an autoencoder is used. Various methods, including statistical approaches and machine learning tools, are employed to estimate the shape and total daily consumption. This methodology allows for the completion of missing data across specific time slots or entire days. The document is structured as follows: The proposed methodology is explained in
Section 2, and tested in
Section 3. Since shape encoding is quite important in our proposal, some remarks about it are made in
Section 4. A brief note on the software implementation is included in
Section 5, and the main conclusions and findings are given in
Section 6.
2. Proposed Methodology
Before explaining our methodology, we emphasize that each load profile contains two important types of information: the amount of energy, and the time it is used. In other words, each load profile can be characterized by two geometrical properties: shape and area. Area is the total amount of energy used in the day, and is quite simple to compute. Shape, on the other hand, refers to the distribution of the energy usage along the day.
Shape is important in many situations. Consider, for example, demand response programs. In order to build a portfolio of suggestions for users, it is more important to know the type of users being served than the amount of energy they consume. For example, two street light users may be interested in the same programs, regardless of the amount of energy they consume. In order to capture the shape of a load profile, we perform a two-stage method: first, we perform a by-row normalization of the profile, and then, we encode the shape applying autoencoders to the normalized profiles. These two steps are explained in
Section 2.1 and
Section 2.2, respectively. We use the encoded shape and the daily energy consumption as the key variables to solve the imputation problem; the methodology to solve the total imputation problem is explained in
Section 2.3. The
partial imputation problem is addressed in
Section 2.4.
2.1. By-Row Normalization
By-row normalization is a kind of normalization that preserves the shape [
30]. Consider a matrix
M, whose values are power measurements of a smart meter
where
is the measurement in day
i and slot
j. The daily amount of energy is
We define
the matrix whose elements are the fraction of the daily amount of energy in a single slot; that is,
Now, consider the minimum and maximum values of those fractions:
A by-row normalization of the matrix
M is a standard scaler of
, with
as extreme values. Each element of the normalized matrix
is calculated by
As a result, . Usually, , in which case . By using the fraction of the daily energy instead of the absolute energy, the normalized matrix preserves the profile shapes.
2.2. Shape Encoding Using Autoencoders
Autoencoders are a type of Artificial Neural Network designed to encode some values. By encoding, we mean mapping from to the input values. Since , usually, the autoencoders can be seen as a tool to reduce dimensionality.
Figure 1a shows the structure of a simple autoencoder. The input and output layers have
m neurons, while the hidden layer has
l neurons. The network must be trained to act as the identity function, i.e., to calculate outputs equal to the the inputs. If the training of the network is successful, then the information of the input variables will also be present in the hidden layer, because the output layer is able to reproduce the input. In other words, the input information is encoded in the hidden layer. To access the encoded information, the network is divided into two networks, Encoder and Decoder, as shown in
Figure 1b. It is common to refer to the encoded variables as
latent or
hidden variables, living in the
latent or
hidden space.
Figure 2 shows how to extract energy and shape information from a dataset,
. We use (
2) directly from the dataset to compute a matrix,
, whose values will be the daily energy consumption. Then, we apply a by-row normalization, which will give us the matrix
. The next step is to train an autoencoder to encode
into
, the matrix that will contain the shape information.
2.3. Total Profile Imputation
Data synthesis refers to the generation of non-real data based on existing data. Profile synthesis is the generation of non-real load profiles based on existing and load profiles. In this sense, the total imputation problem can be seen as a case of the profile synthesis problem. Of course, the synthesized data must retain some characteristics of the existing data in order to be plausible. To generate a synthetic profile, we propose using a “divide and conquer” strategy, and splitting the problem into two subproblems: (a) synthesizing the shape, and (b) synthesizing the amount of energy.
As we will show in
Section 3, each of the two synthesizing subproblems can be addressed by using any of the data imputation methods that are available in the literature. In addition, the explanatory variables of each subproblem may be different and context-dependent. Whatever methods and variables are used, they will produce synthetic values of energy and shape that allow us to construct the
and
matrices.
Figure 3 shows how to use
and
to obtain a matrix (
) of synthetic profiles.
and
are random variables that are used to model additive noise. We decode
into
, a matrix of normalized synthetic profiles, that are then denormalized using
to obtain
. Denormalization is just the elementary process of inverting Equations (
1)–(
5).
2.4. Partial Profile Imputation
If all of the measurements of a certain day are missing (a whole row of
is missing), we impute a synthetic profile, generated as explained in
Section 2.3. However, if some of the data for that day exists, we must merge the real information with the synthetic one.
Consider a single day. Let R and S be arrays of data that contain the real and synthetic profiles for that day. We must merge them to generate and array M. The proposal is: (a) keep the real data, (b) keep the shape of the synthetic for those slots with missing data, and (c) adjust the amount of energy of the synthetic with the energy information of the real data.
We define r as the set of the slots for which real data exists, and s as the set of the remaining slots. If there are m slots, then . We use the superindexes to represent the subset of data of an array. For example, is the subset of real data that exists.
To meet conditions (b) and (c), we calculate
and
, the real and synthetic energy in the slots in which real data exists:
Then, we adjust the amount of energy in the missing data slots
The whole profile is just .
4. Some Remarks on Shape Encoding
As stated in
Section 2, the methodology proposed in this paper is useful when it is important to model the shape of the energy consumption profile. For that reason, shape encoding is a critical step in the success of the model. Since encoding is a kind of dimensionality reduction strategy, it is logical to ask whether or not other dimensionality reduction alternatives would be useful.
We tested the most popular dimensionality reduction strategy, Principal Component Analysis (PCA).
Figure 15 compares the synthetic shape profile obtained with autoencoders and PCA, using 1, 2, and 3 dimensions in the reduced space. Although the shapes are similar for a single dimension, as the number of dimensions increases, the synthetic profiles obtained by PCA become distorted. The distortion is so pronounced that some of the profiles obtained have negative values of consumption, which turns out to be a significant error.
On the other hand, it should be noted that the randomness of the autoencoder training has an important effect on shape encoding. As the encoded shapes are internal variables of a Neural Network, the meaning of the encoded variables changes from one training process to another, even with the same training data set. In other words, the latent space changes. As a result, any interpretation based on the encoded variables is limited to the trained autoencoder that produced them. Consider, for example,
Figure 5; in
Section 3.1.1 we stated that ‘as it varies from 0 to 1, the shape changes from an almost flat profile to one that has a high value at noon’. However, for another trained autoencoder, the behavior may be the opposite.
The randomness of autoencoder training has another fallback. Remember that the autoencoder must be trained to act as an identity function, but this is not always achieved. Verification is always needed. In all of our experiments, when we need an autoencoder, we train 10 and use the best one. Each of the 10 is tested to make sure it maintains the variability of the shapes.
Finally, we would like to point out that shape coding has other potential applications. For example, the detection of abnormal consumption is a concern for many companies. It is possible to detect some cases using outlier identification techniques in the latent space.
6. Conclusions
Data imputation is a context-sensitive problem. There is no silver bullet to solve it. The application of the dataset is perhaps the most important issue to consider, because in some of them, the load profile is critical, while in others, it is not. For those applications where the load profile is important, the shape itself can be considered a relevant feature, and data imputation must try to preserve its statistical properties.
In this paper, we have shown that autoencoders are suitable for quantifying the geometric concept of the shape of load profiles. Once in a numerical space, the shape imputation can be treated as a conventional data imputation problem. Since data encoding can be viewed as a dimensionality reduction process, we have also shown that PCA, the most popular dimensionality reduction tool, does not preserve the shape of the original data.
For the dataset used in the examples, we found that it was a good idea to perform the data imputation based on ad hoc profile synthesis. We synthesize the shape and the daily energy based on the type of day and the day of the year in separate processes, and then combine them to create the synthetic load profile. A simple linear regression over auxiliary variables was sufficient, and even outperformed Neural Networks.
Two quantitative results must be emphasized: first, the average error is reduced by a factor of
compared to data imputation based on the average of the available data. Second, the standard deviation of the error in the best cases is low (
), as can be seen in
Table 1 and
Table 3. A low standard deviation indicates a high degree of reliability in the selected parameter.
Although the results are satisfactory, we do not claim that this procedure will be useful for every energy consumption dataset. On the contrary, we emphasize that the characteristics of the data sets must be taken into account in order to design a good data imputation mechanism.
In our tests, the latent space of the autoencoder performs well with a single dimension; adding more dimensions does not yield significant benefits. It should be clarified that the appropriate dimensionality of the latent space may be influenced by the data set used. The latent space has potential applications, such as detecting abnormal consumption patterns of interest to different utility companies and identifying outliers. Outlier detection techniques in the latent space can successfully identify some anomalous cases.