Data Imputation in Electricity Consumption Profiles through Shape Modeling with Autoencoders

Duarte, Oscar; Duarte, Javier E.; Rosero-Garcia, Javier

doi:10.3390/math12193004

Open AccessArticle

Data Imputation in Electricity Consumption Profiles through Shape Modeling with Autoencoders

by

Oscar Duarte

¹

,

Javier E. Duarte

^2,*

and

Javier Rosero-Garcia

²

¹

Department of Electrical and Electronic Engineering, Faculty of Engineering, Universidad Nacional de Colombia, Bogotá 111321, Colombia

²

EM&D Research Group, Department of Electrical and Electronic Engineering, Faculty of Engineering, Universidad Nacional de Colombia, Bogotá 111321, Colombia

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(19), 3004; https://doi.org/10.3390/math12193004

Submission received: 16 August 2024 / Revised: 19 September 2024 / Accepted: 20 September 2024 / Published: 26 September 2024

(This article belongs to the Special Issue Modeling, Simulation, and Analysis of Electrical Power Systems)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we propose a novel methodology for estimating missing data in energy consumption datasets. Conventional data imputation methods are not suitable for these datasets, because they are time series with special characteristics and because, for some applications, it is quite important to preserve the shape of the daily energy profile. Our answer to this need is the use of autoencoders. First, we split the problem into two subproblems: how to estimate the total amount of daily energy, and how to estimate the shape of the daily energy profile. We encode the shape as a new feature that can be modeled and predicted using autoencoders. In this way, the problem of imputation of profile data are reduced to two relatively simple problems on which conventional methods can be applied. However, the two predictions are related, so special care should be taken when reconstructing the profile. We show that, as a result, our data imputation methodology produces plausible profiles where other methods fail. We tested it on a highly corrupted dataset, outperforming conventional methods by a factor of 3.7.

Keywords:

data imputation; electricity consumption profiles; autoencoders; electrical profiles; smart meters; advanced metering infrastructure; synthetic profiles

MSC:

68U01

1. Introduction

The issue of missing data in datasets has been a persistent challenge across various domains, significantly affecting the reliability of analyses and decision-making processes in smart grid applications. In the field of electricity, the deployment of smart meters has facilitated the collection of vast quantities of data, which are instrumental for billing, operational control, and maintenance of power distribution networks [1]. Those datasets are not immune to the occurrence of missing data, primarily due to hardware failures, communication disruptions, and software glitches, posing substantial obstacles to the accurate estimation of electricity consumption patterns and the efficient operation of the power grid [1].

Dealing with missing data in electrical consumption datasets is particularly challenging, especially when the sample period is on the order of minutes or hours [1]. The difficulties are related to high dimensionality, temporal structure, and the presence of both short-term and long-term gaps. User consumption has a strong correlation on previous records; for instance, the consumption at 10 am is closely related to the consumption at 9 am. This strong relationship offers advantages, as traditional statistical tools like interpolation can estimate consumption over short time intervals [1]. However, this approach does not account for macro variables, such as the type of day (working day or not) or external events, and it is inadequate for missing data spanning more than three records.

Moreover, sometimes user consumption patterns or daily consumption waveforms must also be considered. For example, in demand response studies like Time of Use pricing, it is necessary to consider user consumption patterns, since hourly rates are based on these profiles [2]. A holistic approach to data imputation in consumption profiles must account for the waveform, both for short-term gaps (within 3 h) and longer-term gaps (exceeding 4 h), including entire days.

There are at least three alternatives for handling missing data in a dataset: (1) delete incomplete records, (2) make it explicit that the data are missing, and (3) impute data. Deleting results in the loss of information about the available attributes of incomplete records, and making missing data explicit is only possible for categorical attributes [3]. The third alternative, imputation, is to replace missing data with substitute values.

There is an extensive and rich literature on data imputation methods. A high-level introduction can be found in [4], and a more specialized explanatory text in [5]. As a general approach, depending on the relationship between the data values and the probability of missing data, it is necessary to distinguish between three three cases: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR) [6]. In this paper, we focus on power consumption datasets and assume that missing data are completely random. In other words, we have no idea why the data are missing.

Among the imputation strategies, statistical imputation and machine learning-based methods, such as Linear Imputation, k-Nearest Neighbors Imputation, and Variational Autoencoders, have been explored with varying degrees of effectiveness [7,8]. These methods, however, often struggle to fully capture the complex dynamics and temporal dependencies inherent in electric consumption data. A comprehensive comparison of imputation methods can be found in [9]. The performance of 19 algorithms was tested on 15 real-world benchmark datasets. Algorithms are classified in three groups: statistical algorithms [1,7,10,11,12], machine learning algorithms [13,14,15,16,17,18,19], and deep learning algorithms [20,21,22].

However, power consumption datasets have some distinctive characteristics that make them difficult to complete using conventional data imputation algorithms. First, of all, they are time series. Data imputation for time series requires a different approach, because each individual data are related to past and future data. If those relationships are known, they can be used to estimate the value of the missing data. For univariate time series, the simplest approach is to interpolate missing data, but this is only useful when the number of adjacent missing data are small. More sophisticated approaches are based on system identification techniques that search for error prediction models (i.e., ARIMA series) or space state models (i.e., Kalman filter) [23]. For multivariate problems, there are also some approaches that rely on inter-attribute correlations to estimate values for the missing data [24].

Second, power consumption time series usually exhibit multiple seasonalities. As smart meters allow us to measure consumption every 15 min, they capture daily, weekly, monthly, and yearly consumption behaviors, which may be quasi-periodic. Methods for imputing time series do not account for seasonality, or only account for one seasonality. As stated in [9], it has been speculated that some non-recurrent Neural Network architectures (convolutional and self-attentional) have the ability to model long-range dependencies in the incomplete time series data, because they can connect distant values via shorter network paths.

Third, one of these seasonalities, daily seasonality, is very important for many electrical engineering applications. The description of daily consumption is known as the load profile, and it is critical for short-term demand forecasting, demand-side management programs, and time-of-use pricing programs, to name a few electrical engineering problems. We will refer to total profile imputation when all the data of a certain day is missed (the entire load profile is missed), and partial profile imputation when just some data of a certain day is missed.

Fourth, it is not uncommon for power consumption datasets to have large sets of adjacent missing records. Those data come from smart meters, and a failure in one meter can last hours or days, causing dozens of adjacent records to be missed. The entire load profile for adjacent days may be missed. Unfortunately, most time series imputation methods are useful only when the number of adjacent missing data are small.

In summary, power consumption datasets require specialized data imputation methods. For example, in [25], an interpolation that takes into account the daily and weekly periodicity is shown. This method produces continuous profiles with respect to the adjacent available measurements, which is a highly desirable feature for power flow analyses.

In [1], a set of eight load profiles is found from the available data, and total imputation is made by choosing the most feasible profile. The research highlights typical measurement errors, including short-term data shortages, defective data samples, and long-term data gaps. To tackle these issues, the authors use linear interpolation and other methods, such as spline functions and cubic interpolation, for estimating missing values. Additionally, they propose filters like median, Hampel, and Savitzky–Golay to reduce extreme disturbances in the data. The study models daily load profiles based on recorded measurements, which are then used to estimate missing data. Through a comparative analysis of various modeling approaches, the research concludes that the cyclic bi-Gaussian model offers a high match quality for predicting and filling in missing data, thus ensuring reliable load profile estimations and supporting efficient grid management.

In [26], clustering of available load profiles is made and the resulting centroids are used to estimate the missed profile based on correlation distances. The study involved aggregating the daily load profiles of 100 domestic customers over a week, followed by clustering these profiles into several segments based on the k-means algorithm. The authors tested various time windows to determine the optimal segmentation for minimizing estimation errors. The study applied four distance functions—Euclidean, Manhattan, Canberra, and Pearson correlation—to evaluate the accuracy of the load estimates. The simulation results indicated that the Canberra distance function provided the most accurate load estimates, with the smallest mean absolute percentage error (MAPE) and root-mean-square error (RMSE). The methodology demonstrated robustness in estimating both short-term (within 3 h) and long-term (exceeding 4 h) gaps in load data, offering a reliable tool for managing partial imputation.

In [27], denoising autoencoders are used to make partial imputation. The autoencoders are trained with moving windows of data to encode the shape of short-term trends. This study demonstrated the effectiveness of the denoising autoencoders by significantly reducing the root mean square error (RMSE) of imputation compared to traditional methods and other generative models like variational autoencoders and Wasserstein autoencoders. This method proved to be particularly beneficial for daily accumulated error metrics, reducing errors by up to 56% compared to other approaches, thus providing a highly accurate and reliable method for imputing missing values in electrical load data.

A different situation occurs when measurements from several meters are available. Missing data from one meter can be estimated using the information from the other meters. For example, in [28], the imputation problem is formulated as a spatiotemporal problem with high sparsity. The spatial dimension is the geographic position of the meter. The problem is addressed using a robust estimator based on principal components pursuit (PCP).

Another example is in [29]. An algorithm that utilizes power flow analysis to estimate missing values from smart meters is used. The methodology unfolds in three phases: detection, operation, and estimation. Initially, the system identifies a missing reading at any node. It then prompts adjacent nodes for additional data like voltage, current, and power factor with higher measurement resolution. Finally, using the gathered data from neighboring nodes and power equations, the missing values are estimated through numerical approximations. The authors highlight that this method surpasses traditional regional averaging and data imputation techniques, particularly in scenarios involving extended consumption periods or periodic consumption, such as holidays, by maintaining higher accuracy and reliability in data estimation.

In this paper, we address the total and partial profile imputation problems. We assume that the available data can be enriched with external information. However, we do not consider the use of geographic or topological information. The available data may come from the measurements of a single user or a group of users.

Our methodology for imputing the missing data from smart meters splits the problem into two subproblems: estimating the shape, and estimating the total daily consumption. To model the shape, an autoencoder is used. Various methods, including statistical approaches and machine learning tools, are employed to estimate the shape and total daily consumption. This methodology allows for the completion of missing data across specific time slots or entire days. The document is structured as follows: The proposed methodology is explained in Section 2, and tested in Section 3. Since shape encoding is quite important in our proposal, some remarks about it are made in Section 4. A brief note on the software implementation is included in Section 5, and the main conclusions and findings are given in Section 6.

2. Proposed Methodology

Before explaining our methodology, we emphasize that each load profile contains two important types of information: the amount of energy, and the time it is used. In other words, each load profile can be characterized by two geometrical properties: shape and area. Area is the total amount of energy used in the day, and is quite simple to compute. Shape, on the other hand, refers to the distribution of the energy usage along the day.

Shape is important in many situations. Consider, for example, demand response programs. In order to build a portfolio of suggestions for users, it is more important to know the type of users being served than the amount of energy they consume. For example, two street light users may be interested in the same programs, regardless of the amount of energy they consume. In order to capture the shape of a load profile, we perform a two-stage method: first, we perform a by-row normalization of the profile, and then, we encode the shape applying autoencoders to the normalized profiles. These two steps are explained in Section 2.1 and Section 2.2, respectively. We use the encoded shape and the daily energy consumption as the key variables to solve the imputation problem; the methodology to solve the total imputation problem is explained in Section 2.3. The partial imputation problem is addressed in Section 2.4.

2.1. By-Row Normalization

By-row normalization is a kind of normalization that preserves the shape [30]. Consider a matrix M, whose values are power measurements of a smart meter

M = [\begin{matrix} M_{1, 1} & M_{1, 2} & \dots & M_{1, m} \\ M_{2, 1} & M_{2, 2} & \dots & M_{2, m} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ M_{n, 1} & M_{n, 2} & \dots & M_{n, m} \end{matrix}]

(1)

where

M_{i, j}

is the measurement in day i and slot j. The daily amount of energy is

E_{i} = \sum_{j = 1}^{m} M_{i, j}

(2)

We define

\bar{M}

the matrix whose elements are the fraction of the daily amount of energy in a single slot; that is,

{\bar{M}}_{i, j} = \frac{M_{i, j}}{E_{i}}

(3)

Now, consider the minimum and maximum values of those fractions:

{\bar{M}}_{m n} = min_{i, j} ({\bar{M}}_{i, j}) {\bar{M}}_{m x} = max_{i, j} ({\bar{M}}_{i, j})

(4)

A by-row normalization of the matrix M is a standard scaler of

{\bar{M}}_{m n}

, with

{\bar{M}}_{m x}

as extreme values. Each element of the normalized matrix

N

is calculated by

N_{i, j} = A \frac{{\bar{M}}_{i, j} - {\bar{M}}_{m n}}{{\bar{M}}_{m x} - {\bar{M}}_{m n}} + \frac{1 - A}{2} with A \in [0, 1]

(5)

As a result,

\frac{1 - A}{2} \leq N_{i, j} \leq \frac{1 + A}{2}

. Usually,

A = 1

, in which case

0 \leq N_{i, j} \leq 1

. By using the fraction of the daily energy instead of the absolute energy, the normalized matrix

N

preserves the profile shapes.

2.2. Shape Encoding Using Autoencoders

Autoencoders are a type of Artificial Neural Network designed to encode some values. By encoding, we mean mapping from

R^{m}

to

R^{l}

the input values. Since

l < m

, usually, the autoencoders can be seen as a tool to reduce dimensionality.

Figure 1a shows the structure of a simple autoencoder. The input and output layers have m neurons, while the hidden layer has l neurons. The network must be trained to act as the identity function, i.e., to calculate outputs equal to the the inputs. If the training of the network is successful, then the information of the input variables will also be present in the hidden layer, because the output layer is able to reproduce the input. In other words, the input information is encoded in the hidden layer. To access the encoded information, the network is divided into two networks, Encoder and Decoder, as shown in Figure 1b. It is common to refer to the encoded variables as latent or hidden variables, living in the latent or hidden space.

Figure 2 shows how to extract energy and shape information from a dataset,

M

. We use (2) directly from the dataset to compute a matrix,

E

, whose values will be the daily energy consumption. Then, we apply a by-row normalization, which will give us the matrix

N

. The next step is to train an autoencoder to encode

N

into

S

, the matrix that will contain the shape information.

2.3. Total Profile Imputation

Data synthesis refers to the generation of non-real data based on existing data. Profile synthesis is the generation of non-real load profiles based on existing and load profiles. In this sense, the total imputation problem can be seen as a case of the profile synthesis problem. Of course, the synthesized data must retain some characteristics of the existing data in order to be plausible. To generate a synthetic profile, we propose using a “divide and conquer” strategy, and splitting the problem into two subproblems: (a) synthesizing the shape, and (b) synthesizing the amount of energy.

As we will show in Section 3, each of the two synthesizing subproblems can be addressed by using any of the data imputation methods that are available in the literature. In addition, the explanatory variables of each subproblem may be different and context-dependent. Whatever methods and variables are used, they will produce synthetic values of energy and shape that allow us to construct the

E_{s}

and

S_{s}

matrices.

Figure 3 shows how to use

E_{s}

and

S_{s}

to obtain a matrix (

M_{s}

) of synthetic profiles.

η_{S}

and

η_{E}

are random variables that are used to model additive noise. We decode

S_{s}

into

N_{s}

, a matrix of normalized synthetic profiles, that are then denormalized using

E_{s}

to obtain

M_{p}

. Denormalization is just the elementary process of inverting Equations (1)–(5).

2.4. Partial Profile Imputation

If all of the measurements of a certain day are missing (a whole row of

M

is missing), we impute a synthetic profile, generated as explained in Section 2.3. However, if some of the data for that day exists, we must merge the real information with the synthetic one.

Consider a single day. Let R and S be arrays of data that contain the real and synthetic profiles for that day. We must merge them to generate and array M. The proposal is: (a) keep the real data, (b) keep the shape of the synthetic for those slots with missing data, and (c) adjust the amount of energy of the synthetic with the energy information of the real data.

We define r as the set of the slots for which real data exists, and s as the set of the remaining slots. If there are m slots, then

r \cup s = {1, 2, \dots, m}

. We use the superindexes to represent the subset of data of an array. For example,

R^{r}

is the subset of real data that exists.

Condition (a) means

M^{r} = R^{r}

(6)

To meet conditions (b) and (c), we calculate

E_{r}

and

E_{s}

, the real and synthetic energy in the slots in which real data exists:

E_{r} = \sum_{j \in r} R_{i} E_{s} = \sum_{j \in r} S_{i}

(7)

Then, we adjust the amount of energy in the missing data slots

M^{s} = \frac{E_{r}}{E_{s}} S^{s}

(8)

The whole profile is just

M = M^{r} \cup M^{s}

.

3. Examples

3.1. Example 1

We have applied the above methodology to a real data set with missing data. The measurements correspond to the energy consumption of the largest campus of Universidad Nacional de Colombia, in Bogotá, from 5 October 2016 to 3 December 2019. Measurements are taken on an hourly basis, so there are 24 slots available (

m = 24

). Since the interval is 1154 days, we should expect 27,696 total measurements. However, 6674 data points are missing, which is

24.1 %

of the data. There are 212 days with no data, and 236 days with partial missing data.

Figure 4 has been drawn to provide some understanding of the raw data. As expected, the shape of the daily profile is not the same for every day of the week, and energy consumption varies throughout the year. This is the typical behavior of campus energy consumption. Note that the campus is close to the equator (the latitude is

4^{\circ}

N) and, therefore, there are no major climatic variations throughout the year. However, it is possible to identify two periods of low energy consumption that coincide with academic breaks at the turn of the year and in the middle of the year.

3.1.1. Shape Extraction and Modeling

As a first trial, we encoded the shape in a latent space of one dimension (

l = 1

). In other words, our Encoder has just one neuron in the hidden layer. Figure 5 helps us to understand the meaning of the only latent variable: as it varies from 0 to 1, the shape changes from an almost flat profile to one that has a high value at noon.

Figure 6 shows the time series of the only encoded shape. In order to understand the data, we enrich the data set with two variables:

Day of the year ( $D O Y$ ): with $d = 0.0$ for January 1 and $d = 1.0$ for December 31.
Type of day ( $T O D$ ): a discrete variable whose values are 0 for regular days, 1 for Saturdays, and 2 for holidays.

Figure 7 shows the relationship between the encoded shape and

D O Y - T O D

. It is clear that these variables are good candidates to be used as explanatory variables for the shape. Instead of trying a general-purpose modeling strategy, such as Neural Networks, we try an ad hoc model. Our proposal is an ensemble of three models, one for each day type. The behavior of the data in Figure 7 is similar to the absolute value of a sinusoidal function. Therefore, we propose that each of the three models be a function, such as

y = A | sin (2 π x) | + B

(9)

where y is the encoded shape, and

x = D O Y

, A, and B are the parameters to be found for each of the three models. Using a variable transformation

\bar{x} = | sin (2 π x) |

, the parameters can be found with a simple linear regression. Figure 8a compares the real and modeled encoded shape values. Each color corresponds to one of the three models. Although the model performs well, it can be seen that the model for regular days presents several points where the prediction is much higher than the actual value. The explanation for this fact lies in the fact that the determination of the type of day was based on the official calendar and, therefore, some special dates were not taken into account.

To illustrate this fact, the profiles of the days on which the prediction are the best and the worst are plotted in Figure 9. The worst prediction corresponds to Thursday, 7 September 2017. That day, the Pope arrived in Bogotá, and the university administration, for some reason beyond the logic of the authors, decided that there should be no academic activities that day. That day should be marked as a holiday instead of a regular day.

3.1.2. Energy Modeling

Figure 10 plots the relationship between daily energy consumption and two possible explanatory variables: the day of the year and the encoded shape. It can be seen that there is a very strong correlation between form and energy. We will explore this relationship in Section 3.2, but in this example, we use another ensemble model with three models of functions like Equation (9). Figure 8b compares the real and the modeled daily energy.

3.1.3. Data Imputation

Using the shape and energy models, we performed a data imputation process, as shown in Figure 3, in order to find plausible values for each of the 6674 missing data points.

η_{S}

and

η_{E}

are white noise with variance set to

0.02

. Figure 11 shows the time series of daily energy consumption before and after imputation. In the second, a color code was used to indicate the days with partial or total (full) imputation. It can be seen that many strangely low values have been corrected, which correspond to days with missing data in some of the slots.

To further explore the effects of data imputation, additional comparisons were performed, the results of which are shown in Figure 12:

The histograms of daily energy consumption confirm that the low values corresponding to the missing days have been corrected. Both show a bimodal distribution, but the corrected one has more high values.
Three typical profiles were found using the k-means clustering algorithm. The profiles are quite similar, showing that the imputation has maintained the typical shapes.
The variability of the data can be visualized using the box plots of each slot. Comparison of these plots shows that the imputation of the data has transformed the zero values into values closer to the boxes. In addition, it is observed that in the bands with lower variability (early morning), the imputation is performed both above and below the central values.

3.2. Example 2

In this section, we use the same data of Example 1, but change the models used to predict the shape and daily energy consumption. To compare the results, we used a simple metric based on the actual and predicted predicted values for each slot of the days with no missing data, namely the mean value of the absolute relative error:

ε = \frac{1}{N} \sum_{i \in D} \sum_{j = 1}^{m} |\frac{M_{i, j} - M_{S_{i, j}}}{M_{i, j}}|

(10)

where

M_{i, j}

and

M_{S_{i, j}}

are the actual and synthetic measurements in day i and slot j, m is the number of slots, D is the set of the days with no missing data, and N is the total number of measurements in D, i.e.,

N = m \times c a r d i n a l i t y (D)

. In order to facilitate the interpretation of the metric values, it should be noted that that the model explained in Example 1 has

ε = 8.9 %

.

Three experiments were conducted to examine the effect of the following design parameters on model performance: (a) the number of latent variables, and (b) the use of Neural Networks.

It should be noted that the imputation based on the average of the available data has

ε = 32.98 %

, whereas the model explained in Example 1 has

ε = 8.9 %

. This means that our method outperforms conventional methods by a factor of

3.7

.

3.2.1. Experiment 1

In the first experiment, we changed the number of neurons in the hidden layer of the autoencoder from 1 to 6. Each of the latent variables was modeled using a function like the one in Equation (9). The experiment was repeated 10 times for each case. The differences between the results are explained by the randomness of the autoencoder training. The results are shown in Table 1, and plotted in Figure 13a as box plots.

Note that increasing the number of variables does not improve model performance. This may seem counterintuitive, because if encoding is a dimensionality reduction, then the less you reduce the number of dimensions, the more information is retained. Note, however, that each additional neuron in the hidden layer implies 49 more parameters to be found (24 weights from the input layer, 24 weights to the output layer, and an offset). In other words, each additional neuron implies an increase of 49 dimensions in the search space. The number of points that are used to train the autoencoder (i.e., the number of days with no missing data (706)) is simply not enough.

3.2.2. Experiment 2

In the second experiment, we tested Neural Networks to model shape and energy. In all experiments, we used multilayer perceptrons with a hidden layer of 15 neurons; the activation function was always the logistic function. We tested eight different characteristics, whose features are shown in Table 2. Architecture 1 is the model of Example 1, and has been included only for ease of comparison. In architectures 2 to 4, the energy was modeled using Equation (9) and the shape with a Neural Networks with different inputs. The opposite has been performed in architectures 5 to 8, where the shape was modeled using Equation (9) and the energy with a Neural Network with different inputs.

The experiment was repeated 10 times for each architecture. The differences between the results are explained by the randomness of the autoencoder and Neural Networks training. The results are shown in Table 3, and plotted in Figure 13b as box plots. It is clear that all tested architectures using Neural Networks have lower performance than the model in Example 1. A plausible explanation for this fact lies in the difficulty that Neural Networks have in modeling periodic functions [31], recalling that shape and energy has a six-month quasiperiodic behavior (Figure 7 and Figure 10).

3.3. Example 3

In this section, we compare our results with those obtained using the most common imputation methods. A usual classification distinguishes between univariate and multivariate imputation methods. To estimate missing data for a given variable, multivariate methods use non-missing information from other variables. However, this approach is useless when dealing with energy datasets, because the data are often lost due to instrument or communication malfunctions that cause the information of all variables to be lost simultaneously. Therefore, in this example, we will only compare our results with the following three univariate methods:

Zero Order Hold (ZOH): missing data are replaced with the last valid data available.
Mean: missing data are replaced by the average of the available data.
Linear interpolation: missing data are replaced by linear interpolation between the last and the next available data.

In this example, we select a subset of 9 days with a lot of missing data, as shown in Figure 14a. The resulting time series obtained with the usual imputation methods are shown in Figure 14b. Note that none of these produce acceptable results. This is because they ignore the importance of the profile shape. Moreover, consider the possible errors that these imputation methods will cause in the energy calculus. As energy is the integral of power, ignoring the shape is a big mistake. In contrast, Figure 14c shows the time series obtained with our method, which is undoubtedly plausible.

4. Some Remarks on Shape Encoding

As stated in Section 2, the methodology proposed in this paper is useful when it is important to model the shape of the energy consumption profile. For that reason, shape encoding is a critical step in the success of the model. Since encoding is a kind of dimensionality reduction strategy, it is logical to ask whether or not other dimensionality reduction alternatives would be useful.

We tested the most popular dimensionality reduction strategy, Principal Component Analysis (PCA). Figure 15 compares the synthetic shape profile obtained with autoencoders and PCA, using 1, 2, and 3 dimensions in the reduced space. Although the shapes are similar for a single dimension, as the number of dimensions increases, the synthetic profiles obtained by PCA become distorted. The distortion is so pronounced that some of the profiles obtained have negative values of consumption, which turns out to be a significant error.

On the other hand, it should be noted that the randomness of the autoencoder training has an important effect on shape encoding. As the encoded shapes are internal variables of a Neural Network, the meaning of the encoded variables changes from one training process to another, even with the same training data set. In other words, the latent space changes. As a result, any interpretation based on the encoded variables is limited to the trained autoencoder that produced them. Consider, for example, Figure 5; in Section 3.1.1 we stated that ‘as it varies from 0 to 1, the shape changes from an almost flat profile to one that has a high value at noon’. However, for another trained autoencoder, the behavior may be the opposite.

The randomness of autoencoder training has another fallback. Remember that the autoencoder must be trained to act as an identity function, but this is not always achieved. Verification is always needed. In all of our experiments, when we need an autoencoder, we train 10 and use the best one. Each of the 10 is tested to make sure it maintains the variability of the shapes.

Finally, we would like to point out that shape coding has other potential applications. For example, the detection of abnormal consumption is a concern for many companies. It is possible to detect some cases using outlier identification techniques in the latent space.

5. Imputation Python Library

In the course of this research, a data imputation library was developed in Python, and is available at https://github.com/jeduartea/data_imputation_electricity_profiles/tree/main (accessed on 16 August 2024). This library leverages the profile synthesizer configuration depicted in Figure 3 to estimate the matrices (

S_{s}

) of the hidden variables and

E_{s}

of the daily energy consumption, using a multi-layer Artificial Neural Network (ANN). The Neural Network was designed to estimate the hidden variables takes the day of the year (DOY) and the type of day (TOD) as inputs, while the Neural Network for estimating daily load also incorporates these hidden variables alongside the type of day and day of the year as its inputs. Once the hidden variables are estimated, they are fed into the Decoder to generate the shape. This shape, along with the daily consumption data, is then scaled to finally produce the synthetic profile, as illustrated in Figure 16.

The developed code employs an object-oriented programming paradigm, defining four primary objects: “Data”, “Autoencoder”, “ANN”, and “DataImputation”. The “Data” object is responsible for loading the dataset, performing preprocessing tasks (such as cleaning, hot-one encoding, and normalization), and dividing the dataset into training, validation, and test sets. The “Autoencoder” object sets up the autoencoder parameters and manages its training, including the separation into Encoder and Decoder components. The “ANN” object defines the parameters for the multilayer Neural Network and oversees its training. The “DataImputation” object, the core component, initializes and utilizes the aforementioned objects to train the Neural Networks (Autoencoder, ANN daily load, ANN hidden variables), as depicted in Figure 17. Additionally, “DataImputation” also configures the synthetic profile generation model, as illustrated in Figure 16 through the ‘predict_profile‘ method, which generates a synthetic profile based on the day of the year, type of day, and the degree of randomness.

6. Conclusions

Data imputation is a context-sensitive problem. There is no silver bullet to solve it. The application of the dataset is perhaps the most important issue to consider, because in some of them, the load profile is critical, while in others, it is not. For those applications where the load profile is important, the shape itself can be considered a relevant feature, and data imputation must try to preserve its statistical properties.

In this paper, we have shown that autoencoders are suitable for quantifying the geometric concept of the shape of load profiles. Once in a numerical space, the shape imputation can be treated as a conventional data imputation problem. Since data encoding can be viewed as a dimensionality reduction process, we have also shown that PCA, the most popular dimensionality reduction tool, does not preserve the shape of the original data.

For the dataset used in the examples, we found that it was a good idea to perform the data imputation based on ad hoc profile synthesis. We synthesize the shape and the daily energy based on the type of day and the day of the year in separate processes, and then combine them to create the synthetic load profile. A simple linear regression over auxiliary variables was sufficient, and even outperformed Neural Networks.

Two quantitative results must be emphasized: first, the average error is reduced by a factor of

3.7

compared to data imputation based on the average of the available data. Second, the standard deviation of the error in the best cases is low (

σ = 0.002

), as can be seen in Table 1 and Table 3. A low standard deviation indicates a high degree of reliability in the selected parameter.

Although the results are satisfactory, we do not claim that this procedure will be useful for every energy consumption dataset. On the contrary, we emphasize that the characteristics of the data sets must be taken into account in order to design a good data imputation mechanism.

In our tests, the latent space of the autoencoder performs well with a single dimension; adding more dimensions does not yield significant benefits. It should be clarified that the appropriate dimensionality of the latent space may be influenced by the data set used. The latent space has potential applications, such as detecting abnormal consumption patterns of interest to different utility companies and identifying outliers. Outlier detection techniques in the latent space can successfully identify some anomalous cases.

Author Contributions

Conceptualization, O.D., J.E.D.; Methodology, O.D., J.E.D., J.R.-G.; investigation, O.D., J.E.D.; writing—original draft preparation, O.D., J.E.D.; writing—review and editing, O.D., J.E.D., J.R.-G.; funding acquisition, J.R.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Project microgrids: energía flexible y eficiente para Cundinamarca. BPIN: 2021000100523, Convenio Especial de Cooperación No. SCTEI-CDCCO-123-2022 Suscrito Entre El Departamento de Cundinamarca- Secretaria De Ciencia, Tecnología e Innovación y la Universidad Nacional de Colombia y la Universidad de Cundinamarca.

Data Availability Statement

In the interest of supporting transparency and reproducibility in research, this article provides access to selected components of the research data and code. One of the Python codes used in this study, along with one of the data sets utilized, is available at the following GitHub repository: https://github.com/jeduartea/data_imputation_electricity_profiles (data_imputation_electricity_profiles) (accessed on 16 August 2024). This repository includes the necessary details to understand the data processing and analysis conducted during the research. For further inquiries or additional information, please contact the corresponding author.

Conflicts of Interest

The authors declare that there is no conflicts of interest regarding the publication of this paper.

References

Kaszowska, B.; Wóczyk, A.; Zmarzy, D. Assessment of available measurement data, data breaks and estimation of missing data from AMI meters. In Proceedings of the 2019 Modern Electric Power Systems (MEPS), Wroclaw, Poland, 9–12 September 2019. [Google Scholar] [CrossRef]
Duarte, J.E.; Rosero-Garcia, J.; Duarte, O. Analysis of Variability in Electric Power Consumption: A Methodology for Setting Time-Differentiated Tariffs. Energies 2024, 17, 842. [Google Scholar] [CrossRef]
Li, X.; Lei, X.; Jiang, L.; Yang, T.; Ge, Z. A New Strategy: Remaining Useful Life Prediction of Wind Power Bearings Based on Deep Learning under Data Missing Conditions. Mathematics 2024, 12, 2119. [Google Scholar] [CrossRef]
Berthold, M.R.; Borgelt, C.; Höppner, F.; Klawonn, F.; Silipo, R. Guide to Intelligent Data Science. How to Intelligently Make Use of Real Data; Springer: Cham, Switzerland, 2020. [Google Scholar] [CrossRef]
Enders, C. Applied Missing Data Analysis; Methodology in the Social Sciences Series; Guilford Publications: New York, NY, USA, 2022. [Google Scholar]
Aguirre-Larracoechea, U.; Borges, C.E. Imputation for Repeated Bounded Outcome Data: Statistical and Machine-Learning Approaches. Mathematics 2021, 9, 81. [Google Scholar] [CrossRef]
Wu, J.; Koirala, A.; Hertem, D.V. Review of statistics based coping mechanisms for Smart Meter Missing Data in Distribution Systems. In Proceedings of the 2022 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), Novi Sad, Serbia, 10–12 October 2022. [Google Scholar] [CrossRef]
Li, F.; Sun, H.; Gu, Y.; Yu, G. A Noise-Aware Multiple Imputation Algorithm for Missing Data. Mathematics 2023, 11, 73. [Google Scholar] [CrossRef]
Miao, X.; Wu, Y.; Chen, L.; Gao, Y.; Yin, J. An Experimental Survey of Missing Data Imputation Algorithms. IEEE Trans. Knowl. Data Eng. 2023, 35, 6630–6650. [Google Scholar] [CrossRef]
Zhu, M.; Cheng, X. Iterative KNN imputation based on GRA for missing values in TPLMS. In Proceedings of the 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), Harbin, China, 19–21 December 2015; Volume 1, pp. 94–99. [Google Scholar] [CrossRef]
Twala, B.; Cartwright, M.; Shepperd, M. Comparison of various methods for handling incomplete data in software engineering databases. In Proceedings of the 2005 International Symposium on Empirical Software Engineering, Noosa Heads, QLD, Australia, 17–18 November 2005; p. 10. [Google Scholar] [CrossRef]
Jerez, J.M.; Molina, I.; García-Laencina, P.J.; Alba, E.; Ribelles, N.; Martín, M.; Franco, L. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 2010, 50, 105–115. [Google Scholar] [CrossRef] [PubMed]
Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
Royston, P.; White, I.R. Multiple Imputation by Chained Equations (MICE): Implementation in Stata. J. Stat. Softw. 2011, 45, 1–20. [Google Scholar] [CrossRef]
Mazumder, R.; Hastie, T.; Tibshirani, R. Spectral Regularization Algorithms for Learning Large Incomplete Matrices. J. Mach. Learn. Res. 2010, 11, 2287–2322. [Google Scholar] [PubMed]
Lee, D.; Seung, H.S. Algorithms for Non-negative Matrix Factorization. In Proceedings of the Advances in Neural Information Processing Systems; Leen, T., Dietterich, T., Tresp, V., Eds.; MIT Press: Cambridge, MA, USA, 2000; Volume 13. [Google Scholar]
Josse, J.; Pagès, J.; Husson, F. Multiple imputation in principal component analysis. Adv. Data Anal. Classif. 2011, 5, 231–246. [Google Scholar] [CrossRef]
Miranda, V.; Krstulovic, J.; Keko, H.; Moreira, C.; Pereira, J. Reconstructing missing data in state estimation with autoencoders. IEEE Trans. Power Syst. 2012, 27, 604–611. [Google Scholar] [CrossRef]
Pereira, R.C.; Santos, M.S.; Rodrigues, P.P.; Abreu, P.H. Reviewing Autoencoders for Missing Data Imputation: Technical Trends, Applications and Outcomes. J. Artif. Intell. Res. 2020, 69, 1255–1285. [Google Scholar] [CrossRef]
Gondara, L.; Wang, K. MIDA: Multiple Imputation Using Denoising Autoencoders. In Proceedings of the Advances in Knowledge Discovery and Data Mining; Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 260–272. [Google Scholar]
Mattei, P.A.; Frellsen, J. MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Spinelli, I.; Scardapane, S.; Uncini, A. Missing data imputation with adversarially-trained graph convolutional networks. Neural Netw. 2020, 129, 249–260. [Google Scholar] [CrossRef] [PubMed]
Moritz, S.; Bartz-Beielstein, T. imputeTS: Time Series Missing Value Imputation in R. R J. 2017, 9, 207–218. [Google Scholar] [CrossRef]
Anindita, N.; Nugroho, H.A.; Adji, T.B. A Combination of multiple imputation and principal component analysis to handle missing value with arbitrary pattern. In Proceedings of the 2017 7th International Annual Engineering Seminar (InAES), Yogyakarta, Indonesia, 1–2 August 2017; pp. 1–5. [Google Scholar] [CrossRef]
Peppanen, J.; Zhang, X.; Grijalva, S.; Reno, M.J. Handling bad or missing smart meter data through advanced data imputation. In Proceedings of the 2016 IEEE Power and Energy Society Innovative Smart Grid Technologies Conference (ISGT), Minneapolis, MN, USA, 6–9 September 2016; pp. 1–5. [Google Scholar] [CrossRef]
Al-Wakeel, A.; Wu, J.; Jenkins, N. k-means based load estimation of domestic smart meter measurements. Appl. Energy 2017, 194, 333–342. [Google Scholar] [CrossRef]
Ryu, S.; Kim, M.; Kim, H. Denoising Autoencoder-Based Missing Value Imputation for Smart Meters. IEEE Access 2020, 8, 40656–40666. [Google Scholar] [CrossRef]
Mateos, G.; Giannakis, G.B. Load Curve Data Cleansing and Imputation Via Sparsity and Low Rank. IEEE Trans. Smart Grid 2013, 4, 2347–2355. [Google Scholar] [CrossRef]
Kodaira, D.; Han, S. Topology-based estimation of missing smart meter readings. Energies 2018, 11, 224. [Google Scholar] [CrossRef]
Duarte, O.G.; Rosero, J.A.; Pegalajar, M.d.C. Data Preparation and Visualization of Electricity Consumption for Load Profiling. Energies 2022, 15, 7557. [Google Scholar] [CrossRef]
Ziyin, L.; Hartwig, T.; Ueda, M. Neural networks fail to learn periodic functions and how to fix it. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]

Figure 1. Encoder and Decoder.

Figure 2. Shape and energy extraction from data set.

Figure 3. Profile synthesis.

Figure 4. Average behavior of raw data.

Figure 5. Profiles generated for different value inputs with one latent space.

Figure 6. Time series of the encoded shape for Example 1.

Figure 7. Encoded shape vs. type of day and day of the year for Example 1.

Figure 8. Validation of shape and energy models for Example 1.

Figure 9. Best and worst profile predictions. The left graph shows the best prediction, while the right graph shows the worst.

Figure 10. Energy vs. day of the year and encoded shape for Example 1.

Figure 11. Time series of daily energy consumption before and after data imputation.

Figure 12. Effects of data imputation. The left side shows the data before imputation, and the right side shows the data after applying the proposed imputation method.

Figure 13. Errors in experiments of Example 2.

Figure 14. Data sets in Example 3.

Figure 15. Comparison between the autoencoders and PCA for predicting synthetic profiles. The left side shows profiles generated by the autoencoders, and the right side displays profiles generated by the PCA, both with 1 to 3 dimensions.

Figure 16. Profile synthesizer with Artificial Neural Networks.

Figure 17. Training workflow with Artificial Neural Networks.

Table 1. Errors in experiment 1.

	Repetition
Dim	1	2	3	4	5	6	7	8	9	10	Mean	Std
1	0.097	0.091	0.09	0.09	0.09	0.09	0.089	0.089	0.089	0.089	0.090	0.002
2	0.141	0.11	0.091	0.147	0.091	0.09	0.091	0.09	0.091	0.101	0.104	0.021
3	0.101	0.099	0.09	0.113	0.09	0.09	0.136	0.089	0.134	0.092	0.103	0.017
4	0.095	0.095	0.131	0.11	0.09	0.115	0.092	0.132	0.133	0.092	0.109	0.017
5	0.134	0.093	0.133	0.099	0.16	0.094	0.133	0.097	0.134	0.09	0.117	0.023
6	0.102	0.135	0.133	0.133	0.133	0.133	0.131	0.099	0.131	0.134	0.126	0.013

Table 2. Features of architectures in experiment 2.

	Shape Model		Energy Model
No.	Type	Explanatory Variables	Type	Explanatory Variables
1	Equation (9)	DOY	Equation (9)	DOY
2	Neural Network	DOY	Equation (9)	DOY
3	Neural Network	TOD	Equation (9)	DOY
4	Neural Network	DOY-TOD	Equation (9)	DOY
5	Equation (9)	DOY	Neural Network	DOY
6	Equation (9)	DOY	Neural Network	TOD
7	Equation (9)	DOY	Neural Network	DOY-TOD
8	Equation (9)	DOY	Neural Network	Latent variable

Table 3. Errors in experiment 2.

	Repetition
Case	1	2	3	4	5	6	7	8	9	10	Mean	Std
1	0.089	0.09	0.089	0.095	0.089	0.09	0.091	0.094	0.095	0.089	0.091	0.002
2	0.17	0.174	0.171	0.17	0.175	0.171	0.17	0.17	0.171	0.174	0.172	0.002
3	0.101	0.104	0.101	0.101	0.102	0.103	0.107	0.102	0.1	0.1	0.102	0.002
4	0.175	0.171	0.104	0.103	0.099	0.105	0.102	0.103	0.105	0.106	0.117	0.028
5	0.262	0.262	0.26	0.263	0.263	0.256	0.261	0.26	0.261	0.262	0.261	0.002
6	0.193	0.144	0.188	0.125	0.143	0.157	0.138	0.13	0.136	0.139	0.149	0.022
7	0.129	0.163	0.138	0.165	0.151	0.137	0.196	0.149	0.147	0.125	0.150	0.020
8	0.132	0.108	0.113	0.136	0.212	0.254	0.111	0.12	0.11	0.112	0.141	0.048

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Duarte, O.; Duarte, J.E.; Rosero-Garcia, J. Data Imputation in Electricity Consumption Profiles through Shape Modeling with Autoencoders. Mathematics 2024, 12, 3004. https://doi.org/10.3390/math12193004

AMA Style

Duarte O, Duarte JE, Rosero-Garcia J. Data Imputation in Electricity Consumption Profiles through Shape Modeling with Autoencoders. Mathematics. 2024; 12(19):3004. https://doi.org/10.3390/math12193004

Chicago/Turabian Style

Duarte, Oscar, Javier E. Duarte, and Javier Rosero-Garcia. 2024. "Data Imputation in Electricity Consumption Profiles through Shape Modeling with Autoencoders" Mathematics 12, no. 19: 3004. https://doi.org/10.3390/math12193004

APA Style

Duarte, O., Duarte, J. E., & Rosero-Garcia, J. (2024). Data Imputation in Electricity Consumption Profiles through Shape Modeling with Autoencoders. Mathematics, 12(19), 3004. https://doi.org/10.3390/math12193004

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data Imputation in Electricity Consumption Profiles through Shape Modeling with Autoencoders

Abstract

1. Introduction

2. Proposed Methodology

2.1. By-Row Normalization

2.2. Shape Encoding Using Autoencoders

2.3. Total Profile Imputation

2.4. Partial Profile Imputation

3. Examples

3.1. Example 1

3.1.1. Shape Extraction and Modeling

3.1.2. Energy Modeling

3.1.3. Data Imputation

3.2. Example 2

3.2.1. Experiment 1

3.2.2. Experiment 2

3.3. Example 3

4. Some Remarks on Shape Encoding

5. Imputation Python Library

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI