**Deep Learning-Based Short-Term Load Forecasting for Supporting Demand Response Program in Hybrid Energy System**

#### **Sholeh Hadi Pramono \*, Mahdin Rohmatillah \*, Eka Maulana, Rini Nur Hasanah and Fakhriy Hario**

Department of Elctrical Engineering, University of Brawijaya, Veteran Road, Lowokwaru, Malang 65113, Indonesia

**\*** Correspondence: sholehpramono@ub.ac.id (S.H.P.); rohmatillahmahdin1994@gmail.com (M.R.); Tel.: +62341-554166 (S.H.P.)

Received: 8 August 2019; Accepted: 28 August 2019; Published: 30 August 2019

**Abstract:** A novel method for short-term load forecasting (STLF) is proposed in this paper. The method utilizes both long and short data sequences which are fed to a wavenet based model that employs dilated causal residual convolutional neural network (CNN) and long short-term memory (LSTM) layer respectively to hourly forecast future load demand. This model is aimed to support the demand response program in hybrid energy systems, especially systems using renewable and fossil sources. In order to prove the generality of our model, two different datasets are used which are the ENTSO-E (European Network of Transmission System Operators for Electricity) dataset and ISO-NE (Independent System Operator New England) dataset. Moreover, two different ways of model testing are conducted. The first is testing with the dataset having identical distribution with validation data, while the second is testing with data having unknown distribution. The result shows that our proposed model outperforms other deep learning-based model in terms of root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). In detail, our model achieves RMSE, MAE, and MAPE equal to 203.23, 142.23, and 2.02 for the ENTSO-E testing dataset 1 and 292.07, 196.95 and 3.1 for ENTSO-E dataset 2. Meanwhile, in the ISO-NE dataset, the RMSE, MAE, and MAPE equal to 85.12, 58.96, and 0.4 for ISO-NE testing dataset 1 and 85.31, 62.23, and 0.46 for ISO-NE dataset 2.

**Keywords:** short-term load forecasting; deep learning; wavenet; long short-term memory; demand response; hybrid energy system

#### **1. Introduction**

Nowadays, the hybrid energy system has become more popular in the electricity industry. The main reason of this trend is the exponential reduction of energy storage cost and the development of digital connection, enabling real time monitoring and smarter grid establishment. Moreover, the hybrid energy system is considered as one of the best solutions in tackling intermittency experienced by most renewable energy schemes including solar- and wind-based energy. For example, in solar photovoltaic, the energy is only delivered when obtaining sufficient solar irradiance. As a consequence, a lot of research has been conducted in order to provide the best scheme of this hybrid system [1].

Among provided hybrid energy system schemes, the most possible way to be implemented in the majority of countries is the integration between renewable and fossil energy, because fossil energy has already well established. In case of sustainability, this scheme is very good, because fossil energy is obviously able to supply adequate power to the grid when the alternative sources cannot handle users' load demand. The major concern of this scheme is the cost to be borne by users once the

renewable sources cannot supply adequate power to the grid in which they will pay an expensive fossil-based electric price. Moreover, fossil energy tends to gradually increase leading to economic conflict in society [2]. Therefore, in order to tackle this issue, an appropriate demand response scheme can be applied.

Demand response is the change of electric usage by users due to change of electric price or maybe an incentive as a reward of lowering their power consumption [3]. Applying demand response to this hybrid system is very beneficial for shaving peak load demand [4,5], leading to the reduction of fossil energy consumption. Moreover, it can provide short-term impact and economic benefit for both consumer and utility.

In order to support this demand response, short-term load forecasting (STLF) is very important for predicting whether the energy storage from renewable sources is able to handle the forthcoming power consumption or not. If the prediction states that the storage is not adequate to support the future load, then the electricity utility can announce this situation to the users, which eventually triggers them to reduce their electric usage, because users do not only want to pay more for conventional energy source but also want to get incentives from the authorities.

Fortunately, with the help of developed infrastructures like smart meters equipped with a lot of sensors and the Internet of Things (IoT), a robust STLF method is feasible to be implemented. Broadly speaking, research in load forecasting can be categorized into two research classes, traditional and advanced model. Traditional model uses simple statistics method for example regression models [6] and Kalman filtering model [7]. Nevertheless, among proposed traditional models, autoregressive integrated moving average (ARIMA) and generalized autoregressive conditional heteroscedascity (GARCH) are two of the most popular techniques in regression function that were used in several precedent research studies [8,9]. Unfortunately, these traditional models only provide good accuracy if the electrical load and other parameters have a linear relationship. Meanwhile, the advanced model is a data-driven model implementing the machine learning technique for instance support vector machine (SVM) [10], K-nearest neighbor (KNN) [11], and others [12–14].

However, based on the recent publications [15–18], the deep learning-based methods show the most convincing performance by outperforming other machine learning-based solutions. The main reason of the deep learning superiority is first, deep learning does not highly rely on feature engineering and the hyperparameters tuning is relatively easier compared to other data-driven models. The second is the availability of huge datasets, where deep learning can precisely map the inputs to the certain output by making complex relations among layers in the network based on that huge training data. Moreover, since the availability of Graphics Processing Unit (GPU) parallel computation and methods providing weights sharing like convolutional neural network (CNN) [19], the computational speed of deep learning models become significantly faster.

Because of the superiority of the deep learning, this research proposes a method in load forecasting task, specifically STLF, to predict the hourly power consumption by using deep learning algorithms which is the combination of the advanced version of the convolutional neural network (CNN), dilated causal residual CNN, and long short-term memory (LSTM) [20].Dilated causal residual CNN is inspired by the Wavenet model [21], which is very famous for audio generation, and the residual network [22] with gated activation function. This model will learn the trend based on long sequence input while the LSTM layer works as a model's output self-correction which relates the output of the wavenet-based model with the recent load demand trend (short sequence).

The main contribution of this research is that we propose a novel model utilizing a combination of dilated causal residual CNN and LSTM utilizing long and short sequential data and fine tuning technique. External feature extraction or feature selection data are not included in this research. Moreover, this research only takes into account time index information as the external factor data, making it easy to be compared as a benchmark model for future research.

In order to prove the generality of our proposed model performance, two different scenarios of model testing are conducted. The first scenario is using the testing dataset having identical distribution with the validation dataset, while the second is using dataset having unknown distribution. As a comparison, our proposed model results are compared with the performance of the model from [15,16] and the standard wavenet [21]. The simulation result shows that our proposed model outperforms other deep learning models in root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).

Therefore, due to the accuracy of our model performance, this model can be used for supporting utilities in applying demand response program since it can help the utilities to obtain accurate prediction about the future load demand that eventually providing precise information to the users whether the future load demand can be supplied by the renewable source or not.

The rest of the paper is organized as follows. Section 2 describes the dataset used in detail including preliminary data analysis and data preprocessing. Section 3 explains the model architecture and its parameter, training, and testing stage. Section 4 provides information about results obtained by using the proposed models compared with other deep learning-based models and clear explanation about the reason why the proposed model can achieve the result. Lastly, Section 5 summarize the findings discussed in this paper and also possible future works.

#### **2. Dataset**

In order to prove the generality of our proposed method, two datasets are used as the model's input which are the ENTSO-E (European Network of Transmission System Operators for Electricity) [23] and ISO-NE (Independent System Operator New England) dataset [24]. The ENTSO-E dataset is the dataset obtained from load demand in every country in Europe. In this research we only take into account data gathered from Switzerland. Meanwhile, the ISO-NE dataset is data of hourly load demand in New England.

Those datasets have different kinds of characteristics, especially in the case of load demand range and complexity. In the ENTSO-E dataset, the lowest and highest value of load are 1483 KWh and 18,544 KWh, respectively, while in the ISO-NE dataset they are 9018 KWh and 26,416 KWh, respectively. Another different characteristic is the fluctuation trend in a single day load demand. In the ENTSO-E dataset, the load demand is more oscillated compared to the ISO-NE dataset. As proof, Figures 1 and 2 show the average daily power consumption and example of load trend in a day based on those datasets and example of demand trend of each dataset.

**Figure 1.** Average daily power consumption of each dataset.

**Figure 2.** Example of one-day load demand of each dataset (**a**) ISO-NE dataset; (**b**) ENTSO-E dataset.

In building a solution for load forecasting, deep understanding of the load demand trend is very necessary. Figure 3 shows data in both datasets over a year period. From Figure 3, broadly speaking, the trend of load demand has a periodicity that will be repeated over the next weeks. However, this trend is highly affected by a lot of external factors causing a fluctuation over a certain period, for example, the economic, weather, and time index. Unfortunately, obtaining those external factors are difficult, the easiest external data that can be gathered is time index data containing information about the date and clock. Therefore, in this research we not only fed the model by load demand trend but also time index data represented by one-hot vector. One-hot vector is a sparse vector that maps a feature with M categories into a vector which has M elements where only the corresponding element is one while the remaining are zero.

**Figure 3.** Load demand trend over a year.

Before fed to the model, the datasets must be pre-processed, which consists of checking for null values, splitting dataset into training, validation and testing dataset, and eventually data normalization. The data used for this research is limited only to data taken from 1 January 2015 until 30 May 2017 for the ENTSO-E dataset and from 1 January 2004 until 30 May 2006 for the ISO-NE dataset. The first two years of data are used for the training stage, while the rest of the 3600 data are split into 3000 and 600 data. The first 3000 data are randomly taken for validation and testing data with proportion of nearly 0.65 and 0.35 to be used for validation and the testing stage, respectively. In other words, the total of validation data and first testing data are 1900 and 1100, respectively, while another 600 data are used for the second testing data. We conduct two testing stage, because the first testing data have a nearly identical distribution with the validation data which clearly make the model provide

good accuracy in the testing stage. Meanwhile, the second testing data is clearly new data that their distribution is never experienced by the model both in the training and validation stage. This kind of testing process is appropriate for proving the generality of the model.

In the data normalization process, min-max scaling method as expressed in Equation (1) is implemented.

$$z\_i = \frac{x\_i - \min(x)}{\max(x) - \min(x)}\tag{1}$$

*x* = *x*1, ... , *xn* and *zi* is the *i* th normalized data. The parameters in the normalization process must come from training dataset only, because we assume that the future data (validation and test set) have different distribution with the training dataset.

#### **3. Model Design**

In this research, the cores of the proposed model are wavenet architecture implementing both of the dilated causal CNN and residual network and LSTM, which have proven very well in time-series data prediction.

Our proposed model consists of two stages that have a function to learn long and short sequence data. Inspired by the success of wavenet architecture and LSTM in handling time series data, the long sequence taken from the 32 time steps before the target is learned using wavenet while the short sequence taken from 4 time steps before target is learned using LSTM.

#### *3.1. Wavenet*

Wavenet consists of deep generative models utilizing the dilated causal convolutional neural network of audio waveform. Causal convolution means that the output of the recent time step is only affected by the previous time step. Meanwhile, the dilated convolutional neural network is a modified convolutional neural network where the filter weight alignment is expanded by a dilation factor that eventually results in a broader receptive field that can be expressed as follows:

$$(F \ast\_l k)(p) = \sum\_{s+lt=p} F(s)k(t) \tag{2}$$

while the standard convolution is expressed as follows:

$$(F\*k)(p) = \sum\_{s+t=p} F(s)k(t) \tag{3}$$

The dilated convolution is denoted by ∗*<sup>l</sup>* notation. The difference between dilated convolution with standard convolution is the notation *l* representing the dilation factor which causes the filter to skip one or several points during the convolution process.

Figure 4 shows the dilated convolution applied in one dimensional data. The blue, white, and orange circle are input data, hidden layer output, and output layer output, respectively. There are 32 input data taken from *t* = 1 until *t* = 32 that are convoluted with filter with the size of two. The dilation rate is increased by one in every hidden layer that causes a broader receptive field. This dilation rate is repeated twice. In the output layer, only the last value is taken, which we assume represents the feature of load at *t* = 33.

In order to optimize the usage of the dilated convolutional neural network, the residual technique [22] is applied to the model. The implementation of the residual network will take into account lower levels outputs which have features that will help in predicting the future power demand, especially in the case of a network implementing a sparse filter which has the potential to lose several information from the previous layers.

**Figure 4.** Dilated causal convolutional neural network with filter size 2.

The residual model is a famous way to build the deep neural network that was firstly proposed for the image recognition task. Using this way, instead of only mapping input data *x* to a function *H*(*x*) that outputs *y*ˆ, the mapping scenario from the previous residual block *f*(*x*, {*Wi*}) where *Wi* is the learned weights and biases from the residual block *i* is also considered. Therefore, the output of the residual block can be expressed as:

$$H(\mathbf{x}) = f(\mathbf{x}, \{\mathbf{W}\_i\}) + \mathbf{x} \tag{4}$$

Moreover, since we use the stacked residual block, then the output of the residual block can be represented as:

$$\mathbf{x}\_K = \mathbf{x}\_0 + \sum\_{i=1}^K f(\mathbf{x}\_{i-1}, \mathcal{W}\_{i-1}) \tag{5}$$

*xK* is the output of residual block *K*, *x*<sup>0</sup> is the input of the residual network and *f*(*xi*−1, *Wi*−1) is the output and associated weight of the previous residual blocks. As a result of several summation between the previous and final residual block, then the back propagation of the network to *x*<sup>0</sup> can be calculated using the following equation:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{x}\_0} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}\_K} \frac{\partial \mathbf{x}\_K}{\partial \mathbf{x}\_0} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}\_K} \Big| \mathbf{1} + \frac{\partial}{\partial \mathbf{x}\_0} \sum\_{i=1}^K f(\mathbf{x}\_{i-1}, \mathcal{W}\_{i-1}) \Big| \tag{6}$$

L is the total loss of the network and constant 1 indicates that the gradient of the network output can be directly back-propagated without considering layers' parameters (weights and biases). This formulation ensures the layers do not suffer of vanishing gradient, even the weights are small. Figure 5 shows the basic residual learning process.

Moreover, skip connection and gated activation are applied to the network for speeding up the convergence and avoiding overfitting. The process of residual and skip connection with gated activation is shown in Figure 6.

**Figure 5.** Residual learning process.

**Figure 6.** Overview of residual dilated convolutional block and gated activation function.

The gated activations are inspired by the LSTM layer where tanh and sigmoid (σ) work as learned filter and learned gate, respectively. The use of gated activations has been proved to work better compared to using ReLU activation in time series data [21]. The output of dilated convolution with gated activations can be expressed as:

$$z = \tanh(w\_f \ast \mathbf{x}) \odot \sigma(w\_\mathcal{J} \ast \mathbf{x}) \tag{7}$$

where *wf* and *wg* are learned filter and learned gate, respectively.

#### *3.2. LSTM*

In the case of forecasting future data, the knowledge of the recent trend is very essential. As an illustration, in predicting future data, we mostly start to figure out a long sequence trend. After we have already known the pattern of the trend based on the long sequence of previous data, then in order to provide better forecast, we also try to relate our understanding of long sequence data with the recent trend. The same concept is applied to our proposed method. We fine tune the wavenet-based model with one LSTM layer assigned to help the network to relate the output of dilated CNN with the recent trend. This step can also be considered as a correction step of the dilated CNN output, as we assert a fix input to be concatenated with dilated CNN output which are then fed to the LSTM layer that also work as output layer.

#### Brief Explanation of LSTM

LSTM is the developed version of the standard recurrent neural network (RNN) where instead of only having a recurrent unit, LSTM has "LSTM cells" that have an internal recurrence consisting of several gating units controlling flow of information [25]. Comparison between the simple RNN and LSTM layer using tanh as the activation function is depicted in Figure 7.

**Figure 7.** Comparison between simple recurrent neural network (RNN) and long short-term memory (LSTM) layer: (**a**) simple RNN layer, (**b**) LSTM layer.

From Figure 7, the difference between the simple RNN and LSTM layer is clear. The LSTM layer is more complex than the simple RNN, because LSTM not only takes into account input (*x*) and hidden state (*h*) at a certain time step, but also LSTM cells (*c*) that will replace the hidden state to prevent older signals from vanishing during the process. Three control gates ruling the LSTM cells, forget gate, input gate, and output gate are represented by *ft*, *it*, *Ot*, respectively. Those gates use sigmoid activation function having an output range between 0 and 1 represented by σ. Meanwhile, '*ct* is the input node that works identical to the simple RNN layer.

Mathematically explained, forget gate and input gate, respectively can be expressed as:

$$\dot{a}\_t = \sigma(\mathcal{W}\_i \ltimes [h\_{t-1}, \mathbf{x}\_t] + b\_i) \tag{8}$$

$$f\_t = \sigma(\mathbb{W}\_f \times [h\_{t-1}, \mathbf{x}\_t] + b\_f) \tag{9}$$

[*ht*−1, *xt*] is the concatenation between input and hidden state value, while *W* and *b* are weight matrices, respectively. On the other hand, the cell state is updated with the formulation:

$$\mathfrak{c}\_{t} = \,\_{f} f\_{t} \times \mathfrak{c}\_{t-1} + i\_{t} \times \overline{\mathfrak{c}\_{t}} \tag{10}$$

where '*ct* is expressed as:

$$\overline{c\_{l}} = \tanh(\mathcal{W}\_{\mathfrak{c}} \times [h\_{t-1}, \mathfrak{x}\_{t}] + b\_{\mathfrak{c}}) \tag{11}$$

The last, the output gate *ot* and hidden state *ht* is calculated by using the following equation:

$$\rho\_t = \sigma(\mathcal{W}\_o \times [h\_{t-1}, \mathbf{x}\_t] + b\_o) \tag{12}$$

$$h\_t = o\_t \times \tanh(c\_t) \tag{13}$$

#### *3.3. Detailed Model Setup*

Complete representation of our model is depicted in Figure 8. The wavenet-based model is fed by 32 data sequences containing information of load demand and time index information (clock, day, and month). This wavenet model is used for the initial forecasting algorithm based on long sequence data. Before fed to dilated CNN, long sequence data are prepossessed using standard 1D-CNN with filter size equal to one. Next, the prepossessed data is convoluted by dilated causal CNN with filter size of 2. All of the convolutional layers have ReLU activation function expressed as:

$$f(\mathbf{x}) = \max(0, \mathbf{x}) \tag{14}$$

**Figure 8.** Proposed model architecture.

In the case of the residual network, the total residual block in our model is 10 with a dilation rate set to be a repetition of a sequence *dr* = [1, 2, 4, 8, 16]. Dilation rate is a hyperparameter that represents how large the gap between the element of the filter is, as shown in Figure 4 and indicated by the arrows. All of the residual blocks are summed followed by the ReLU activation function. The post-processing is conducted before the last convolutional layer which works similar with time distributed fully connected layer in which it is assigned for normalization of the residual output.

Because the length of input and output in the dilated causal CNN are identical, then the customize layer is built for taking only the last output's neuron representing load at *t* = 33. After completing the wavenet-based network training, the output of the last convolutional layer is concatenated with recent data sequences (*t* = 29 until *t* = 32) that work as LSTM input. Here, by using the fine tuning technique, LSTM with linear activation function is assigned to make self-correction of convolutional output in order to provide more accurate prediction based on short sequence. This self-correction method is nearly identical to how humans make a prediction based on data sequences. For example, in predicting the environment temperature, humans will relate the understanding between the temperature trend from the previous day with the recent temperature trend in order to make an accurate prediction for the next hour temperature.

Table 1 shows the summary of our model's parameter where *f*, *ks*, *s*, and *dr* are the number of filters, kernel size, stride, and dilation rate, respectively. Loss function and optimizer are mean absolute error and adaptive and momentum (ADAM) [26], respectively, and batch size is 512. The model was trained using Nvidia GTX 1070, Tensorflow 1.13.1 [27], CUDA 10, and CuDNN 7.6.2 with 500 epochs and the final model is chosen based on the validation accuracy.

**Table 1.** Forecasting result obtained using dataset 1.


#### **4. Result and Discussion**

#### *4.1. Model Performance Evaluation*

In order to evaluate our model performance, we mainly compared this model with the two previous deep learning-based models that work very well in the case of STLF. Those models are inspired by [15] (Model 1), which utilizes stacked LSTM and [16] (Model 2) which combines the stacked CNN and LSTM layer with the feature fusion layer. The configuration of each comparison model is identical to the published papers. Model 1 consists of two stacked LSTM layers consisting of 20 units followed by fully connected layers as an output layer. Model 2 is built with a combination between the LSTM and CNN layer where their outputs are concatenated, which is eventually fed to the fully connected layer called a feature fusion layer. All of the models are trained with the same input data. Moreover, the original wavenet model is also used for a model comparison in order to prove the benefit of LSTM as a self-correction layer in our proposed model.

This section reports on the performance of hourly load forecasting by using our proposed method compared to other forecasting methods. In the testing stage, all of the models are evaluated with three difference commonly used metrics, root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). MAE is the average of absolute difference values between predicted load and actual load consumption. MAPE is just identical to MAP but it uses a ratio between the difference with the actual load while RMSE is another metric that tends to have a higher value compared to other metrics. The higher value which results from the metrics, the worse performance of the model. Those metrics are defined as follow:

$$\text{RMSE} = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} \left(\mathcal{y}\_i - y\_i\right)^2} \tag{15}$$

$$\text{MAE} = \frac{1}{N} \sum\_{i=1}^{N} \left| \hat{y}\_i - y\_i \right| \tag{16}$$

$$\text{MAPE} = \frac{1}{N} \sum\_{i=1}^{N} \left| \frac{\hat{y}\_i - y\_i}{y\_i} \right| \tag{17}$$

#### *4.2. ENTSO-E Load Prediction*

Tables 2 and 3 show all of the models' performance on dataset 1 and dataset 2, respectively. Overall, our proposed model outperforms other models, especially in the case of dataset 1 where the data distribution is nearly identical with the validation data, while the worst performance is shown by [16]-based model. In Table 2, our proposed model clearly shows its excellence over other methods which shows the success of the combination between deep residual causal CNN and LSTM using fine tuning technique in understanding long sequence and short sequence data, respectively. Moreover, compared to the standard wavenet model, our model performs better accuracy indicating the usefulness of the LSTM layer in making self-correction for initial load forecasting output by dilated causal residual CNN.

**Table 2.** Forecasting result obtained using European Network of Transmission System Operators for Electricity (ENTSO-E) testing dataset 1.



**Table 3.** Forecasting result obtained European Network of Transmission System Operators for Electricity (ENTSO-E) testing dataset 2.

However, all of models exhibit downgraded performance in dataset 2. It indicates that all of the model still cannot understand data which has slightly different distribution with training, validation, and testing data 1. The failure of all of the models in testing using dataset 2 is highly affected by the quality of the ENTSO-E dataset where the inconsistency of hourly power usage or unpredicted trends occur several times. Figures 9 and 10 show the result of STLF on dataset 1 and dataset 2, respectively.

**Figure 9.** Forecasting result using ENTSO-E dataset 1.

**Figure 10.** Forecasting result using ENTSO-E dataset 2.

#### *4.3. ISO-NE Load Prediction*

Tables 4 and 5 show all of models performance on ISO-NE dataset 1 and dataset 2 respectively. Different with performances shown in the previous subsection, all of the models perform very well both in known and unknown data distribution. However, the model based on [16] still exhibits the worst accuracy, while our model still performs the best although its excellence is not absolute compared to the model based on [15]. Same with the result obtained using the ENTSO-E dataset, the implementation of the LSTM layer for tuning the wavenet-based model, is proven to help the network in making more accurate load predictions.

**Table 4.** Forecasting result obtained using Independent System Operator New England (ISO-NE) testing dataset 1.


**Table 5.** Forecasting result obtained using Independent System Operator New England (ISO-NE) testing dataset 2.


The success of all models in understanding an unknown data distribution in this dataset is mainly because of the ISO-NE data property, which is simpler compared to the ENTSO-E dataset in which more fluctuations are experienced in certain ranges of time due to external factors. This simplicity results in high accuracy both in validation and testing data, enabling all models to handle forthcoming data sequences. Figures 11 and 12 show the forecasting result using input from dataset 1 and dataset 2, respectively.

**Figure 11.** Forecasting result using ISO-NE dataset 1.

**Figure 12.** Forecasting result using ISO-NE dataset 2.

#### *4.4. Discussion*

Overall, our proposed model shows the best performance compared to the other deep learning-based models. It indicates that each network in the proposed model works very well. The wavenet-based network provides understanding of long sequence data and the LSTM layer helps the model do self-correction by relating the output of the first network with the recent or short sequence data.

However, as suggested by [12,28,29], in order to show our proposed model significance over the other models, both the Wilcoxon signed rank test [30] and Friedman test [31] are conducted using all the models' forecasting error given input from those testing datasets. The Wilcoxon signed rank test will compare the *Wstatistic* with the Wilcoxon critical value *W* which are expressed in Equations (18) and (19) (for huge number of data), respectively.

$$\mathcal{W}\_{\text{static}} = \min \{ r^+, r^- \} \tag{18}$$

$$\mathcal{W} = \frac{N(N+1)}{4} \tag{19}$$

*r*<sup>+</sup> and *r*<sup>−</sup> are the sum of the positive and negative rank, respectively, while *N* is the number of data. If *Wstatistic* is less than *W*, and the *p*-value is less than α, then the null hypothesis is rejected, and it indicates the superiority of our model.

On the other hand, the Friedman test is applied to show the significant differences of our proposed models over all comparison models. The statistic F is expressed by:

$$F = \frac{12N}{k(k+1)} \left[ \sum\_{j=1}^{k} R\_j^2 - \frac{k(k+1)^2}{4} \right] \tag{20}$$

where *N* is the number of forecasting results, *k* is the number of compared models, and *Rj* is the average rank sum based on forecasting error *r* for *j*th compared model expressed by:

$$R\_j = \frac{1}{N} \sum\_{1}^{N} r\_i^j \tag{21}$$

If the *p*-value of *F* is less than the critical value, than the null hypothesis is not accepted, indicating the superiority of our model.

Tables 6–9 show the significance test in testing dataset 1 and dataset 2 ENTSO-E and testing dataset 1 and 2 ISO-NE, respectively. In the Wilcoxon signed-rank test, significant levels are set to α = 0.025 and α = 0.05 while in the Friedman test it is conducted only under α = 0.05. From results showed in the tables, our proposed model shows significant contribution in forecasting accuracy improvement and superiority over the other models, except in the ISO-NE dataset, where in the Wilcoxon signed rank test between our proposed model and Kong et al. model the results indicate no significance, although the results in terms of RMSE, MAE, and MAPE show that our model performs better.


**Table 6.** Result of the Wilcoxon signed-rank test and Friedman test using ENTSO-E testing dataset 1.

**Table 7.** Result of the Wilcoxon signed-rank test and Friedman test using ENTSO-E testing dataset 2.





#### **5. Conclusions and Future Works**

This paper proposes a novel method for hourly load forecasting case which is very important in the case of demand response for hybrid energy systems, especially for system use of both renewable and fossil energy in order to reduce fossil energy usage. The proposed method is mainly inspired by the wavenet-based model utilizing dilated causal residual CNN and LSTM. In this approach, two different data sequences are fed to the model. The long data sequences are fed to the wavenet-based model while the short data sequences are fed to the LSTM layer assigned for model self-correction using fine-tuning technique.

In order to prove the generality of our model, two different datasets, which are the ENTSO-E and ISO-NE dataset, are used with two different testing scenarios. The first testing scheme uses the dataset having nearly identical distribution with the validation dataset, while the second uses a dataset from slightly different data distribution. Based on the obtained result, our proposed model exhibits the best performance compared to other deep learning-based models in terms of RMSE, MAE, and MAPE. In detail, our model achieves RMSE, MAE, and MAPE equal to 203.23, 142.23, and 2.02 for ENTSO-E testing dataset 1 and 292.07, 196.95, and 3.1 for ENTSO-E dataset 2. Meanwhile, in the ISO-NE dataset, the RMSE, MAE, and MAPE equal to 85.12, 58.96, and 0.4 for ISO-NE testing dataset 1 and 85.31, 62.23, and 0.46 for ISO-NE dataset 2. However, there are several findings that can be improved in future work. The first is in the ENTSO-E dataset testing result; all models cannot provide high accuracy forecasting if they are fed using slightly different data distribution. It indicates that all models face difficulties in understanding fluctuated or unpredicted data like the ENTSO-E dataset. The second is although RMSE, MAE, and MAPE show our model exhibits better accuracy compared to the Kong et al. model in the ISO-NE dataset, our model cannot provide a significant improvement.

Therefore, for future work, additional external factors data like information of holidays and weather conditions can be fed as the models' input in order to improve our findings. In addition, building a new model can also be conducted since this research area and artificial intelligence (AI) algorithms are developed very quickly, making a new idea come up very fast. All of the codes and datasets used in this research are available on Github.

**Author Contributions:** Conceptualization, S.H.P.; methodology, S.H.P. and M.R.; software, M.R. and E.M.; validation, S.H.P. and M.R.; formal analysis, S.H.P., M.R., E.M. and R.N.H.; investigation, S.H.P., M.R. and E.M.; resources, S.H.P.; data curation, M.R. and F.H.; writing—original draft preparation, M.R. and E.M.; writing—review and editing, S.H.P. and R.N.H.; visualization, M.R., R.N.H. and F.H.; supervision, S.H.P.; project administration, F.H.; funding acquisition, S.H.P.

**Funding:** This research was funded by Indonesia Ministry of Research and Technology (KEMENRISTEKDIKTI) (Grant No.332.51/UN10.C19/PN/2019).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Classification of Special Days in Short-Term Load Forecasting: The Spanish Case Study**

#### **Miguel López \*, Carlos Sans, Sergio Valero and Carolina Senabre**

Electrical Engineering Area, University Miguel Hernández, Av. de la Universidad, s/n, 03202 Elche, Spain; carsantr@gmail.com (C.S.); svalero@umh.es (S.V.); csenabre@umh.es (C.S.)

**\*** Correspondence: m.lopezg@umh.es; Tel.: +34-965-222-407

Received: 4 February 2019; Accepted: 27 March 2019; Published: 1 April 2019

**Abstract:** Short-Term Load Forecasting is a very relevant aspect in managing, operating or participating an electric system. From system operators to energy producers and retailers knowing the electric demand in advance with high accuracy is a key feature for their business. The load series of a given system presents highly repetitive daily, weekly and yearly patterns. However, other factors like temperature or social events cause abnormalities in this otherwise periodic behavior. In order to develop an effective load forecasting system, it is necessary to understand and model these abnormalities because, in many cases, the higher forecasting error typical of these special days is linked to the larger part of the losses related to load forecasting. This paper focuses on the effect that several types of special days have on the load curve and how important it is to model these behaviors in detail. The paper analyzes the Spanish national system and it uses linear regression to model the effect that social events like holidays or festive periods have on the load curve. The results presented in this paper show that a large classification of events is needed in order to accurately model all the events that may occur in a 7-year period.

**Keywords:** load forecasting; special days; regressive models

#### **1. Introduction**

Short-term load forecasting (STLF) is a determining factor for operation of an electric system. It is a necessary process in order to ensure the balance between generation and demand. The system operator needs to know the expected load to make decisions and to perform an optimal control of the electrical system. Many countries have liberalised electrical markets, which promotes the participation of multiple agents. This participation yields a competitive system, which leads to reduced costs to the final consumer. The accuracy of load forecasting leads to an optimization of the power generation and of operation of the system and a consequent reduction of the costs. In addition, a good STLF leads to a better share of renewable energy in the electric system, therefore reducing CO2 emitted to the atmosphere conforming with the Paris Agreement [1], which has been ratified by several countries. This reduction helps to avoid the emission of excess CO2 for countries like those of EU, Iceland, Liechtenstein and Norway, which must comply with the EU Emissions Trading System [2]. The optimization of the electric generation reduces its cost, improving competitiveness among companies and subsequently, the economical and industrial development of a country.

Several works have been published about STLF models in the last decades [3]. These methods split into three large groups: artificial intelligence [4–19], statistical [20,21] and hybrid models [22–24]. Regarding artificial intelligence, these techniques have been successfully applied for STLF, such as artificial neural networks (ANN) [4,5], extreme learning machine (ELM) [6,7], support vector machine (SVM) [8,9,11], adaptive neuro-fuzzy inference system (ANFIS) [10–14], fuzzy logic (FL) [14–16], genetic algorithm (GA) [16] and self-organizing map (SOM) [17–19]. On the other hand, statistical models such

as autoregressive (AR) model [20,21], autoregressive integrated moving average (ARIMA) [21] and exponential smoothing (ES) [25,26] have been extensively used in STLF. Other forecasting methods are considered hybrid models [23,24], which use the combination of various techniques (statistical models and/or artificial intelligence) to obtain forecasting load.

One of the most used methods of AI for the STLF is Neural Networks (NNs), and the amount of research work on this topic found in reference databases like SCOPUS is much higher than for the rest of artificial intelligence techniques. This technique has been used over the last decades obtaining successful results. In the early days, a review [4] of different works published between 1991 and 1999, reports the use of different models of NNs for STLF describing the doubts suggested by some authors about adopting this technique for STLF. This study concludes that many research groups used small data sets with only one NN for STLF. However, other groups have over-parameterized, increasing the complexity of the forecasting model and decreasing the accuracy. In conclusion, further investigations are needed to perform an NN prediction model that clears the controversy in the scientific community.

In recent years, different techniques based on NNs have provided good results forecasting load due to the increase of the historical data used. In [5], different NN methodologies are compared, multilayer perceptron (MLP) [27], radial basis function neural networks (RBFNN) [28], generalized regression neural network (GRNN) [29] and counter-propagation neural networks (CPNN) [30] which learn from patterns that represent the daily load curves. The results showed that the STLF of a GRNN model is more accurate than the rest of NN models analysed in this work.

The ELM technique shown in [6] is a different forecasting model of AI. The method shown in [7] is used to prove the accuracy for STLF versus other NN models. An ELM model provides a better efficiency of the training and a better accuracy of the predictions.

Another technique based on AI is SVM, which is described in [8]. This method is used in [9] to forecast load by combining four SVM models according to certain values of temperature and demand. In [11], a comparison between SVM and ANFIS [12,13] is presented. The comparison uses essential information about the days of a week, achieving much closer results to the actual load by the SVM model.

The performance of the FL model was also compared to that of the ANFIS model using the same parameters and data [14]. The predictions obtained in both cases were successful, although the ANFIS model was more accurate than the FL model. The latter technique was used to obtain the STLF by using different parameters in [15] (e.g., weather, time, historical data), demonstrating that this method can provide a more accurate prediction than conventional models.

In most cases, the artificial intelligence technique called genetic algorithm was used to select the most important parameters for forecasting electricity demand. A genetic algorithm like simulated annealing is used with other technique like FL to obtain the optimal parameters [16] by means of the back propagation method. This method improves the accuracy of the predictions.

Kohonen's Self-Organizing Map (SOM) [17] was also used to successfully obtain forecasting loads [18]. Moreover, the SOM model was a very useful tool for classifying the data of the parameters used to forecast the load. The classification of the meteorological data [19] by using the SOM model to cluster the data provided a prediction of demand through nonlinear autoregressive network with exogenous inputs (NARX) [31].

The autoregressive models are the most commonly used statistical models. Many research articles employ these statistical methods for STLF. In [20], Baharudin et al. analised autoregressive (AR) model and autoregressive–moving-average (ARMA) [32] to obtain forecasting load, concluding that the performance of AR model was more accurate.

A comparison of different statistical models was studied by Taylor et al. [21]. It compared methods such as autoregressive (AR) models, autoregressive integrated moving average (ARIMA) [33], a regression method with principal component analysis (PCA) [34], exponential smoothing (ES) [25] and the Holt-Winters exponential smoothing method. The best performing method was double seasonal Holt-Winters exponential smoothing.

Regarding the hybrid models, which use the techniques from two different methods to obtain the demand prediction, Fan et al. [23] analysed a SOM Neural Network to cluster each data set into subsets. In addition, 24 SVMs are used to adjust each subset to the next day's load profile. Hybrid models can use several forecasting models, where the final forecast is provided by the combination of both models. In previous research [24], the authors stated that, AR and NN methods provide separate forecasts and a final result given by the linear combination of both methods. A linear combination was implemented in order to enhance forecasting accuracy.

One issue that has not received as much attention as the forecasting engine is the forecasting of special days. Several days throughout the year show a profile that does not match the expected profile for its weekday. These differences may be caused by temperature [26] but, the more extreme cases are caused by socio-economic effects of the calendar. In Figure 1, we can see how the load profile for a special day (1 November, a national holiday in Spain) has lower demand values with respect to any normal day. This and other special days are harder to forecast and incur higher forecasting errors that increase forecasting losses. This is especially important because a low average forecasting error with large peak errors may be costlier than a slightly less accurate forecast that has smaller peak errors. This research focuses on how to provide the proper information about these special days so that a forecasting engine may be able to forecast them more accurately.

**Figure 1.** Hourly electricity demand in Spain from 23 October 2017 to 5 November 2017.

There are several ways in which the calendar affects load profiles. Figure 1 already shows that a national holiday has a profile more similar to a Sunday than to a typical weekday. However, this does not mean that all national holidays share the same profile. In addition, the demand pattern of normal days can be altered by the proximity of special days (see Figure 2) such as the Monday before a holiday or the Friday after a holiday. The load profiles of special days do not have the same demand pattern, it can be seen in Figure 2 that the load profile of 12 October is different to the load profiles of normal days. In addition, the demand pattern on 13 October is lower than the demand pattern of a normal Friday, which is because it happens after a holiday. If we consider all special days belonging to only one kind of day, the accuracy of the demand prediction will be affected.

**Figure 2.** Hourly electricity demand in Spain from 2 October 2017 to 15 October 2017.

Each day of the week may have its own profile and, in many cases, there are interactions causing the type of special day to be conditioned by its weekday. The same special day may have different demand patterns depending on the day of the week. Figure 3 displays the profile load of the special day 7 December for the period 2010 to 2017. The day before and after 7 December is a holiday. However, the demand load for Monday, Saturday and Sunday are very different from the other days, as well as the first hours of Thursday and the period between 9 am and 7 pm on the same day. Special days are classified into several types depending on the day of the week.

**Figure 3.** Hourly electricity demand for December 7th, which fell on Tuesday (2010), Wednesday (2011), Friday (2012), Saturday (2013), Sunday (2014), Monday (2015), Wednesday (2016) and Thursday (2017).

However, this classification of special days is not enough to obtain all the information that lies within the profiles of the special days. A deeper classification of the special days is necessary to improve forecasting accuracy, which has been the subject of this work. All demand patterns different from the demand patterns of normal days should be analyzed, taking into account the influences on demand described in this paper. Special days with similar demand patterns are grouped to reduce the complexity and computing times of the forecasting model.

The contribution of this research line in:


This paper is organized as follows: Section 2 describes the state of the art and previous work related to forecasting special days. Section 3 shows the available data for the analysis, the characteristics of the mathematical model used, the treatment to the input variables in order to be processed by the model and the different experiments carried out. Section 4 includes the results of these experiments, analyzing how each classification of days affects modeling accuracy. Section 5 is a brief conclusion expressing the relevance of using complex classification schemes for the different types of days in order to achieve accurate forecasts. Appendices A and B include a more extensive review of the results.

#### **2. State of the Art and Related Work**

The aim of this study is to improve STFL accuracy for special days by defining a more detailed classification. Load profiles of the different days do not follow the same pattern and if we group the most similar demand patterns, the accuracy of the prediction will increase. Demand patterns of special days and demand patterns of normal days are very different. If our forecasting model does not differentiate between normal and special days, the results obtained will be inaccurate on these special days.

In most of the research works related to STFL, all days are classified into two or three large groups such as weekdays, weekends and holiday periods [23,35–37]. Some research articles, where special days are classified according to the demand pattern, they are grouped into 5 [38] and 15 [39] different days, obtaining better results than choosing only three types of days. The profile loads on special days do not have the same demand pattern (i.e., days adjacent to holidays, period of Christmas, Easter, national holidays, week before Christmas) [26,38], consequently the forecasting uncertainty is greater for these days. The day of the week is also an important factor in the load profiles of special days [40–42]: the same special day may have different demand patterns depending on the day of the week. In addition, demand pattern of normal days can be altered by the proximity of special days [38,40,41,43].

Several works have been published for anomalous load forecasting [26,38–44]. In the case of research [42], it only takes into account the days with the greatest errors in the prediction. These days are the holidays that fall on a Saturday or a Monday. This research was done in Korea where Sundays are holidays. Special days are classified into four categories (Tuesday, Wednesday, Thursday and Friday; Saturday; Monday; Sunday). This research only classifies special days depending on the day of the week they fall on. This method reduces the highest prediction errors. However, the accuracy of the prediction can be improved if a deep classification of special days is performed. In [43], the different types of day are classified based on the shape of the load curve into three categories (weekdays (Monday to Friday); Saturday; Sunday and Holidays). Due to the application of special rules, the proximity of the forecast day to a holiday is taken into account. This classification does not differentiate special days from each other. In [26], data are classified according to the type of day and month, to capture the effects of seasonality on the load profile. In addition, three variables are added to check the impact on electricity consumption of holidays, days following a holiday and Easter. In [38], special days are classified into five different types of day (weekday, Saturday, Sunday, Monday and special holidays), but a neural network is necessary for each type of special day. In addition, a fuzzy inference model forecasts the maximum and the minimum loads of a special day. Therefore, if the number of types of special days increases, the forecasting model will be more complex. In [39], SOM is used to group the days with similar load profiles and STFL is performed by means of an NN. SOM performs the classification of the different types of days into groups that can vary between 11 and 15. This technique has been discarded, because the separation of the different types of days requires prior knowledge that is difficult to assemble and whose result is not clear. The classification proposed in this research is similar to the classification described in [40]. Special days are treated according to whether they fall on the same date, the same day of the week, the day of the week is weekday or weekend. However, the classification of special days must be greater as well as the number of days considered as special days. In [41], a variation of the forecasting model described in [40], is used, increasing training period to 8 years and formulating a specific rule to be applied in France. However, the classification of special days into seven categories is still insufficient. In [44], special days are classified into four categories such as common holidays (some national holydays and all local and regional holidays) and three special national holidays.

The use of categorical variables to formulate such classifications in linear regression models is very common [26,41,44–46] and it is the same approach used in this paper. These categorical variables define the type of day and are translated into dummy binary categories that allow the regression model to estimate each type of day individually without any linear assumption among normal or special days.

#### **3. Methodology**

The starting point of this study is a load forecasting system currently in use at the Red Electrica de España (REE) headquarters. REE is the Transport System Operator for the Spanish system. The following paragraphs aim to describe the available data, the variables actually introduced in the system and the mathematical aspects of the model. In addition, this section will describe the different experiments carried out to determine how the variety of special days can be classified and the type of information used to characterize each type of day.

#### *3.1. Available Data*

The input data can be grouped as load, temperature and calendar data:


#### *3.2. Mathematical Model*

The forecasting model used as starting point is thoroughly described in [24]. It includes a neural network and an autoregressive model whose output is combined to provide a single forecast. The combination of both outputs is more accurate than both of them, therefore, both techniques have advantages and are useful as forecasters. However, the black-box characteristics of the neural network makes it less useful to extract conclusions about the model and it is consequently discarded for this study. In addition, the limitation that regression models impose of linearity between each variable and the output can be overcome by linearization methods (temperature) or other techniques. The abnormalities that special days cause in the daily profiles are non-linear because each hour is affected differently and, therefore, the resulting profile is not a scaled copy of the profile of a regular day. In addition, there is no linear relation between the nature of each special day and its effect on the load. The regression model used allows us to include these abnormalities by using individual models for each hour, whose coefficients, therefore, are not restricted in any way and by using separate binary categories for each type of day. The model then provides specific profiles for each category without any relation between the hours within a profile or between profiles of different categories.

The autoregressive model is described by Equation (1), where a given output yt is a linear combination of the model's p previous known errors (et-i), a number of exogenous variables included in Xt and a random shock εt:

$$y\_t = \sum\_{i=1}^p q\_i \cdot e\_{t-i} + X\_t \cdot \theta + \varepsilon\_t \tag{1}$$

This type of process is useful for characterizing time series which are self-correlated to some extent. However, the autoregressive part may also cover up the effect of the exogenous variables. Therefore, in our study, the autoregressive part of the model has been eliminated as shown in Equation (2):

$$y\_t = X\_t \cdot \theta + \varepsilon\_t \tag{2}$$

The effect that any exogenous variable may have on the load may vary throughout the day, which means that the coefficient for each variable may take different values at different times. In order to meet this requirement, the 24 h profile is obtained by using one model for each hour. The input structure for each model is the same. In addition, the output used is the natural logarithm of the load, which experimentally shows a lower modeling error than the actual load.

#### *3.3. Input for the Model*

The exogenous variables used in the model stem from the available data described above. However, due to the non-linearities present in most load-variable relations, a pretreatment is necessary to conform the definite variables going into the exogenous variables matrix:


(Madrid, Barcelona, Seville, Bilbao and Zaragoza) of the 59 available series are used representing the different climate areas in Spain. Linearity is achieved by using the Hot and Cold Degree Days (HDD, CDD) method [44].This technique splits the data series into two different series each one accounting for demand increases for hot and cold temperatures. It requires defining two thresholds splitting the temperature range into three parts: cold days below the cold threshold, neutral zone in between thresholds and hot days above the hot threshold. Therefore, this method models the load-temperature relationship as a piecewise linear function that calculates different slopes for cold and hot days while it sets the slope for the neutral zone to zero. Figure 4 illustrates this methodology. In order to model temperature inertia, the model includes the current value and lagged variables from all locations for the last two days. To sum up, the temperature variables include HDD and CDD series from five locations with the current and two lagged values. This treatment adds up to 30 variables.

• Calendar data: The information about the type of day is critical as load profiles vary greatly from regular Mondays, to Fridays, Sundays, and even among holidays and special periods throughout the year. Therefore, a detailed classification is key to forecasting these special days accurately. In the starting model, 53 variables are used to classify the type of day along with eleven more to assign the month. The variables included in the model from these 53 variables are described in Tables 1–5.

**Figure 4.** Scatter plot of load from the region of Madrid against temperature from the Madrid weather station. The linearization through HDDs and CDDs is shown in green.

A set of 24 variables is used to identify 24 specific days that are considered to have a profile of their own, incompatible with any other day. These cases are described in Table 1:


**Table 1.** National special days.

<sup>1</sup> Weekend instances may qualify into other categories.

The three variables used to classify the rest of national holidays from the B.O.E. are described in Table 2:



Days adjacent to a holiday or special day may show a different load profile depending on the day of the week. Table 3 shows the four variables used to model this phenomenon.



<sup>1</sup> Typically, the profile is more affected if the day is in between the weekend and the holiday.

The classification of regular days is done through six variables for days Monday to Saturday. Variables from Tables 1–3, described categories defined as exclusive: each day belonging to any of these categories may not belong to any other. However, the following categories are thought of as modifiers to the day of the week and may be active at the same time as the day of the week.

Religious holidays are widely observed in Spain and the work and school calendar includes two periods besides the summer-time in which people concentrate their vacation days. The effect of Easter was included in the exclusive variables because, by definition, it happens on the same weekday every year, however, the period around Christmas presents a complexity that forces the use of the following modifiers shown in Table 4:



<sup>1</sup> Days from the week before seem to be differently affected among them while the days in other periods are all equally affected within their own period.

The summer period is affected by a lower demand from industry due to holidays but a higher demand from services due to tourism. This fluctuation is not constant throughout the summer, but may vary weekly. The inclusion of the month variables helps modeling this behavior but, in the case of August, three additional variables have been added to model differences among weeks.

Finally, regional or local holidays are published in regional gazettes and identified by a variable which, in this case, is not binary but equivalent to the fraction of the National Gross Product that the particular region represents.

The effect of Daylight Savings Time is also considered in the initial forecasting model in two ways. Firstly, it relies on the autoregressive part to phase out the error from the time shifts in March and October and secondly, in order to ease the transition, the first three days from each season are considered as special. These special days are not considered in this study because the autoregressive part is removed. To sum up, Table 5 summarizes all variables for the type of day.


**Table 5.** Types of day category summary.

In summary, the model includes one long-term load variable, 30 temperature variables, 11 binary variables for the month and 53 binary variables for the type of day. The output variable is the natural logarithm of the load.

#### *3.4. Experiments and Results*

The aim of this study is to determine the different load profiles that a given day may have depending on social- and labor-related characteristics. To determine whether the classification above is adequate, simplistic or overly complex, this study proposes a series of tests to determine how the accuracy of the model varies as the complexity of the classification increases. These experiments are based on eight different classifications starting from the most simplistic, in which only the day of the week is observed and finishing with the most complex, in which all categories described above are considered. The different models are incrementally defined in Table 6, in which each incremental change is described.

Special days are scarce and certain types may only happen every two or three years. This causes a problem when splitting the data set into training data and out-of-sample testing data. In order to solve this, all experiments have been carried out using each one of the 8 years as the testing period and the other 7 as training data. Therefore, the results provided for each model correspond to an 8-year period for which every year is obtained as an out-of-sample test of the model trained with the other 7 years.


#### **Table 6.** Summary of incrementally complex models.

Each model provides a different output which differs especially on days for which other models use a different classification. It is obvious that when a model includes a new category to better suit a type of day, the days falling under said category are expected to be more accurately modeled. However, it is worth mentioning that by "cleaning up" the category in which these days were before, the remaining days in the category also experience a change in their profile that should be for the best. Therefore, even among models that apparently have the same definition of a category, i.e., all models have a regular Monday category, there may be differences in these categories among these models.

The measure of reported error is the Mean Average Percentage Error (MAPE) and it is categorized by type of day to focus on the specific changes among models. For each general category of type of days, the error from each model that introduces a significant change in the definition of said category is reported. Models that treat a category in the same way and that may only experience collateral changes are not reported for clarity reasons.

In addition to the accuracy of a given classification, it is important to understand how each classification models the load profile. This will help us in determining not only which classification is more accurate but also, and more importantly, how each special day's profile is different from each other and learn the expected load from each of them.

Since the output of each model is not the actual load but the natural logarithm, the effect of each coefficient on the load can be considered not as an addend but as a multiplier to the expected load. Considering the way that the type of day categories are defined, the base profile is that of a regular January Sunday. Therefore, a category coefficient higher than 1 means that the typical load at that particular hour for that particular category is larger than the load expected for a regular January Sunday controlling for other factors like temperature.

The coefficient profile calculated for each category by each model is the second result of this study and it provides a useful tool for understanding the nature of each type of special day and the behavioral changes of the consumer on such types of day. These profiles are included in the results section if they are relevant but all of them are also included in Appendix A for the reader's reference.

#### **4. Results**

The following section describes the results obtained by each model for every category of type of day. These results include the MAPE results for each model to assess the accuracy improvement that the refining of each category adds to the model. In addition to the error, it also includes the coefficient profile so that the accuracy change can be interpreted based on how the category is differently modeled.

#### *4.1. Regular Days*

Even though the aim of this study is not regular days, it will help understand the rest of the results if this category is analyzed. The definition of the category remains unchanged through all eight models; however, more and more special days are removed from the category from model 1 to model 7 and, therefore, the modeling of regular days is more accurate. Figure 5 shows the changes in MAPE on each day of the week through all models. The largest improvement occurs from model 0 to model 1. This model is the first one to acknowledge the existence of special days. Nevertheless, accuracy increases continuously through model 7, especially on Sundays, Mondays and Saturdays.

**Figure 5.** MAPE error of all seven models for the seven days of the week.

The change in the models can be seen in Figure 6a, in which the load profile and the reported MAPE for each model is shown for regular Mondays. The graph shows how the model 0 profile is lower than the other two, which are essentially equivalent. This is because model 0 models regular Mondays after all Mondays in the data set while models 4 and 7 exclude most or all special days, which have a lower profile. The rest of the regular days' results are shown in Appendix A (Figure A3). In Figure 6b the profile for all days of the week from model 7 can be seen and how Monday has a lower start while Friday has a lower finish. Saturday has a unique profile and Tuesday, Wednesday and Thursday are very similar.

**Figure 6.** (**a**) Coefficient profile and error chart for regular Mondays.; (**b**) Coefficient profiles for all regular days in model 7. Sunday is a straight line because it is the reference.

It is also important to quantify whether two profiles are similar enough to be considered the same type of day. In order to do so, Table 7 presents the maximum difference between each profile and its most similar pair. This measure expresses the difference between the profiles for two types of days. Specifically, it is the maximum percentage difference between the two 24 h profiles. It is a measure of how two profiles may or may not be equivalent and represent the same type of day. Both type-of-day categories are named at the top and bottom of the table. In this case, there is less than 1% difference between Tuesday, Wednesday and Thursday and, therefore these three categories are candidates for a joint category.

**Table 7.** Differences between days of the week.


#### *4.2. Special Days Type 1 and Type 2*

Special day types 1 and 2 include 11 fixed dates (see Table 1) that are considered to have specific load profiles. The models that introduce differences into these categories are 0 (all days are regular days), 3 (includes holidays and days before and after), 5 (distinguishes profiles for different days of the week) and 7 (each day is considered unique). Figure 7a shows the model accuracy of the category in these four models. Including these categories as general holidays or before/after holidays lowers the error from 15.6% to around 7%, but then considering each individual holiday yields a much better result of 2.8%. For these special days, assigning different profiles according to the day of the week (model 3 vs. 5) does not improve the result.

In addition, Figure 7b shows the example of Christmas day and how the coefficient profile changes in each model for this particular day. It is interesting how Christmas day shows a much lower load profile than most other special days and so it needs to be considered apart. The coefficient profile and error graphs for the rest of days within the category are included in Appendix A (Figure A3). The maximum difference between the most similar profiles is detailed in Table 8 and it shows that all profiles are independent.

**Figure 7.** (**a**) MAPE for models 0, 3, 5 and 7 for all types 1 and 2 days. (**b**) Coefficient profile and MAPE for models 0, 3, 5, and 7 for Christmas Day.


**Table 8.** Differences between special days type 1 and 2.

<sup>1</sup> The most similar profiles belong to 24 Dec and 30 Dec, but the peak difference is almost 3.5%.

#### *4.3. Special Days Type 3*

Special days type 3 include the Easter period. Each day is considered to have its own profile but differently from types 1 and 2, these days do not happen on the same date each year. The models that affect this type of days are 0, 3, 5 and 7 and the overall accuracy for these days is shown in Figure 8.

**Figure 8.** MAPE for models 0, 3, 5 and 7 for all type 3 days.

This type of special days includes some whose profile is deeply affected, like Good Friday, and others, like the Wednesday after Easter Monday, that are only slightly different to a regular day. Figure 9 shows the coefficient and error graph for these two examples while the rest are included in Appendix A (Figure A3). For clarity reasons, days that happen in the same week as Good Friday are from now on referred to as *A*, while those that happen on the next week will be known as *B.*

**Figure 9.** (**a**) MAPE for models 0, 3, 5 and 7 for Good Friday. (**b**) Coefficient profile and MAPE for models 0, 3, 5, and 7 for the Wednesday after Good Friday.

While Figure 9a shows how accuracy on Good Friday increases as the classification is more complex, it does not happen the same way for the next Wednesday, as it is shown in Figure 9b. This may lead to the conclusion that such a Wednesday should not be considered special, as model 0 yields the most accurate result. However, as aforementioned, the profiles for regular weekdays from model 0 are lower than actual regular days due to the presence of holidays that, in that model, are considered regular. The MAPE for such Wednesday if it was considered as a regular day in model 7 would go as high as 3.3%. Figure 10 shows these profiles and illustrates how the profile for that specific Wednesday is lower than both the profile of a regular Wednesday in models 0 and 7. This justifies the use of these specific categories.

**Figure 10.** The profile for Wednesday B is lower than any of the regular day profiles calculated in models 0 or 7.

Table 9 shows the similarities among the coefficient profiles within the category. In order to illustrate the fact that days like Wednesday B actually differ from regular days, the profiles for regular days from model 7 are included as candidates for most similar. The results show that Friday and Saturday from week B have the most similarities with regular Fridays and Saturdays, but this difference is larger than two percent.


**Table 9.** Differences between special days type 3.

#### *4.4. Regular Holidays and Days Before and After*

Regular holidays are those that are considered to have a generic holiday profile. The days before and after include days which happen before or after a special day, not necessarily a regular holiday. The models that affect these days are 0, 3, 5 and 7. The overall accuracy for these dates and models is shown in Figure 11. In this case it is clear that taking into account the day of the week improves the accuracy (from model 3 to model 5) but also that removing the previously studied special days (some of which were holidays, but others were days before and after) also improves the generic ones (from model 5 to 7).

**Figure 11.** MAPE for models 0, 3, 5 and 7 for generic holidays and days before or after a holiday.

The coefficient profiles and errors graphs for all days in these categories are included in A4. In the case of holidays, the category is split into holidays on weekends, on Mondays and during the rest of the week. Figure 12 shows the coefficient profiles for these categories in the final model.

**Figure 12.** The different profiles that a generic holiday may have depend on the day of the week.

The difference between them can be explained by considering the next and previous days. Holidays whose previous day was a weekday start with a higher profile and holidays whose next day is a holiday finish with a lower profile than a regular Sunday.

A similar phenomenon happens regarding days before and after a holiday. Their coefficient profiles are shown in Figure 13a,b. In both cases, it is clear how, if the day is adjacent to both the holiday and a weekend, then the profile is lower and it is closer to a holiday's profile. However, Figure 13a shows that if the day before a holiday happens Tuesday to Friday then it is more similar to a Friday than to the corresponding weekday. Similarly, Figure 13b shows that if the day after a holiday happens Monday to Thursday, then its profile is more similar to a Monday than to its corresponding profile. Nevertheless, in both cases the profile is slightly lower than a regular day. Table 10 shows these similarities that justify the use of these variables.

**Figure 13.** (**a**) Coefficient profiles for days before a holiday with regular Mondays and Wednesday as references. (**b**) Coefficient profiles for days after a holiday with regular Fridays and Wednesday as references.

**Table 10.** Differences between special days type 1 and 2.


#### *4.5. Christmas Periods*

Three different periods are included in this category as described in Table 5. The models that affect the classification of these periods are 0, 1, 4 and 7. Model 1 only affects the last period because it considers Jan 2nd and 5th as special days, as described in Table 6. The overall accuracy for all the periods is shown in Figure 14. The error increases from model 0 to 1 because both consider all these days as regular, but model 0 assigns a lower profile because it includes more real holidays. Model 4 introduces one category for each period and model 7 assigns one category for each of the four days of the first period.

**Figure 14.** MAPE for models 0, 1, 4 and 7 for all days included in the three Christmas periods.

The first period (20–23 Dec) is defined by one variable for each day. These variables are modifiers, which means that, in addition to them, the day of the week is also active. The meaning of the coefficient profile of these variables is how a regular day is modified by having this variable active. Figure 15 shows the coefficient profiles associated with these four modifiers. It shows how, especially after noon, all days are different to a regular day (which would equal a straight line with a value of 1). In addition, the effect of Christmas is incremented gradually: each day closer to Christmas gets a lower profile than the one before.

**Figure 15.** Coefficient profiles for the modifying variables of 20–23 Dec. Each day closer to Christmas day lowers the demand.

The other two periods are defined by only one variable for each of them. Figure 16 shows the coefficient profiles and error for these two periods. It can be seen in Figure 16a how model 1 yields a worse result because regular days from model 1 have a higher, more realistic profile than in model 0. Nevertheless, the even lower profiles from models 4 and 7 provide a much better model. Panel (b) shows that the last period is affected by the definition of special days in model 1. The lower error in this period from model 0 to model 1 is not obtained through a different general profile (the graph shows very similar profiles) but because two days in this period are affected by another variable.

**Figure 16.** (**a**) Coefficient profile and MAPE for models 0, 1, 4, and 7 for the Christmas period 27–29 Dec. (**b**) Coefficient profile and MAPE for models 0, 1, 4, and 7 for the Christmas period 2–5 Jan.

To determine whether any of these profiles can be joined together, or if any of them is sufficiently similar to a regular day, Table 11 shows the most similar profiles.



The coefficient profiles and error graphs for all variables are included in Table A5.

#### *4.6. Summer Vacation*

The last period is the summer vacation during the month of August. As aforementioned, because vacations are usually assigned by full weeks it is possible that each week will have a specific profile. The results for models 0, 2, 4 and 7 are shown in Figure 17. Model 2 introduces the possibility of distinguishing between the first three weeks and the fourth while model 4 allows the distinction among all four weeks. Model 7 uses the same classification but it benefits from improvements from the definition of other special days (probably related to the national holiday of Aug 15th.).

**Figure 17.** MAPE for models 0, 2, 4 and 7 for all days included in the summer vacation periods.

The different profiles for each week are shown in Figure 18. Weeks 2 and 3 show a profile around 4% lower than weeks 1 and 4, indicating that the two middle weeks concentrate vacations more than the first and last. The coefficient profile and error graphs are included in A6.

**Figure 18.** Coefficient profiles for days within the summer vacation period. Central weeks in August experience a lower demand than weeks 1 and 4.

#### *4.7. Overall Result*

The overall results for all the days that are considered special are shown in Figure 19. The boxplot shows the 5th, 15th, 50th, 85th and 95th percentile of the model errors for regular and special days from model 0 to model 7.

**Figure 19.** (**a**) Boxplot of the distribution of the error on regular days (**b**) Boxplot of the distribution of the error on special days.

The main conclusions that can be drawn from this are that the proposed special day classification yields an average error for special days of 1.84%, which is very near to the average error for regular days (1.78%). In addition, the 95th percentile for special days drops from 17.6% in model 0 to only 4.56% in model 7. This means that only 5% of special days have a modeling error larger than 4.56%. This number is even lower for regular days (4.33%). These results, shown in Table 12, suggest that the

proposed classification is valid for modeling the special days present in the Spanish system almost as accurately as regular ones.


**Table 12.** Differences between special days type 1 and 2.

The extensive categorization presented may be exported to other electrical systems although not all categories may prove to be relevant and some others may be needed. Categories that are not relevant can be identified by assessing the difference between their profile and those of other categories. If two categories present similar profiles, then they may be joined together. However, in order to define new variables, it would be necessary to study the error profiles of the least accurately modeled days and search for a similar pattern in days that can be jointly described in their own category.

#### **5. Conclusions**

Load forecasting is a key activity to any electric system and a lack of accuracy leads to an increase in operating costs. These costs grow exponentially as the error increases which leads to high costs on days for which load is hard to anticipate. These special days are those on which working or social habits differ from the ordinary like on holidays, vacation periods or days adjacent to them. The importance of correctly modeling the behavior of consumers in these special dates is key to reducing the maximum errors of a forecasting system. This paper has tested the validity of a special day classification system with more than 40 variables. This large number of variables may seem excessive as most reported models use much fewer categories. However, the results of this system compared to simpler versions of itself show that in order to model accurately the extensive variety of effects that the calendar has on consumer behavior it is necessary to implement a complex classification system like the one tested in this research. The methodology described can be transferred to other electrical systems with some adjustments to the category definition, but it is within reason that it would prove useful at least in similar systems like France, Portugal or Italy.

**Author Contributions:** M.L. conceived and designed the experiments; C.S. and M.L. performed the experiments; M.l. analyzed the data. M.L., C.S., C.S. and S.V. wrote the paper.

**Funding:** This research has financial support from "Subvenciones para Grupos de Investigación Consolidables AICO/2018/102", from: Consellería de Educación, Investigación, Cultura y Deporte de la GVA (Generalitat Valenciana). Dirección general de Universidad, Investigación y Ciencia.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Figure A1.** Regular days.

**Figure A2.** *Cont.*

**Figure A2.** Special days type 1 and 2.

**Figure A3.** *Cont.*

**Figure A3.** Special days type 3.

**Figure A4.** Regular holidays and days before and after.

**Figure A5.** *Cont.*

**Figure A5.** Christmas periods.

**Figure A6.** Summer vacation.

#### **Appendix B**

All calculated coefficients from the final model for each variable are detailed here:



Categories: (M) Regular Mondays, (T) Regular Tuesdays, (W) Regular Wednesdays, (Th) Regular Thursdays, (F) Regular Fridays, (Sa) Regular

 Saturdays.

**Table A2.** Special days types 1 and 2.




Categories:EasterSunday,(SB) Saturday after Easter Monday.

**Table A4.** Generic holidays and days before or after a holiday.


Categories: (HW) Holiday on Weekend, (HM) Holiday on Monday, (HR) Holiday on Tuesday to Friday, (BM) Before a holiday and Monday, (BR) Before a holiday and Tuesday to (AR) After a holiday and Monday to Thursday, (AF) After a holiday and Friday.



Summer

**Table A6.**

 vacation.


Categories: (W1) 1st week of August, (W2) 2nd week of August, (W3), 3rd week of August.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **HousEEC: Day-Ahead Household Electrical Energy Consumption Forecasting Using Deep Learning**

**Ivana Kiprijanovska 1,2,**†**, Simon Stankoski 1,2,**†**, Igor Ilievski 3, Slobodan Jovanovski 3, Matjaž Gams 1,2 and Hristijan Gjoreski 4,\***


Received: 31 March 2020; Accepted: 20 May 2020; Published: 25 May 2020

**Abstract:** Short-term load forecasting is integral to the energy planning sector. Various techniques have been employed to achieve effective operation of power systems and efficient market management. We present a scalable system for day-ahead household electrical energy consumption forecasting, named HousEEC. The proposed forecasting method is based on a deep residual neural network, and integrates multiple sources of information by extracting features from (i) contextual data (weather, calendar), and (ii) the historical load of the particular household and all households present in the dataset. Additionally, we compute novel domain-specific time-series features that allow the system to better model the pattern of energy consumption of the household. The experimental analysis and evaluation were performed on one of the most extensive datasets for household electrical energy consumption, Pecan Street, containing almost four years of data. Multiple test cases show that the proposed model provides accurate load forecasting results, achieving a root-mean-square error score of 0.44 kWh and mean absolute error score of 0.23 kWh, for short-term load forecasting for 300 households. The analysis showed that, for hourly forecasting, our model had 8% error (22 kWh), which is 4 percentage points better than the benchmark model. The daily analysis showed that our model had 2% error (131 kWh), which is significantly less compared to the benchmark model, with 6% error (360 kWh).

**Keywords:** short-term load forecasting; day ahead; feature extraction; deep residual neural network; multiple sources; electricity

#### **1. Introduction**

Electrical energy (EE) is one of the most significant driving forces of economic development, and is considered essential to daily life. Although EE is a clean form of energy when it is used, the production and transmission of electricity can have a negative effect on the environment. Additionally, overproduction of EE is problematic, because storing excess electricity is challenging and difficult even with today's technological advances. Hence, a system that can accurately predict EE consumption can be used for electricity production planning, and significantly reduce the problems with storage and overproduction.

In recent years, with the introduction of deregulation and liberalization of the energy markets, EE consumption forecasting has become even more relevant. An accurate short-term load forecasting (STLF) system can play a crucial role in effective power system operation and efficient market management. Such a system has multiple benefits: (i) it can optimize the production process, thus reducing the cost of overproduction and improving equipment utilization; (ii) it is eco-friendly, with fewer resources used to produce electricity; (iii) it can help in optimizing power grid load and strengthening reliability; (iv) it can potentially decrease EE consumption costs for households by better planning the production/buying of EE in advance; and (v) it emphasizes EE trading possibilities.

The massive development of smart grid technologies in the residential sector brings many challenges to the load forecasting community. It allows EE consumption to be obtained in close to real time, and allows extraction of valuable data that both the supply and demand side can use for efficient management of the electricity load network.

In recent years, there have been various data-driven approaches for modeling and forecasting EE consumption. Most of them focus on industrial objects, factories, and companies, and some are more focused on households. Furthermore, some focus on short-term forecasts (hourly, daily) with a small prediction horizon (an hour in advance), and some focus on long-term forecasts (weekly, monthly). The studies that focus on STLF with a large prediction horizon (at least one day ahead) are quite limited. Therefore, in this paper, we present the household electrical energy consumption (HousEEC) forecast system, which provides day-ahead household electrical energy (EE) consumption forecasts, using a deep residual neural network (DRNN) that combines multiple sources of information. The key contributions of the paper are as follows:


#### **2. Related Work**

Selecting a forecasting method depends on multiple factors, including the availability and relevance of historical data, desired prediction accuracy, the forecast horizon, and so forth. In recent years, the STLF problem has been tackled by utilizing various methods, each one characterized by different advantages and disadvantages in terms of training complexity, prediction accuracy, limitations in the forecasting horizon, etc. In general, the related work in STLF can be divided into two categories, depending on the type of user (industrial entities or households) and method used (e.g., statistical, ML, DL).

#### *2.1. Related Methods*

With the advent of statistical software packages and artificial intelligence techniques, numerous methods have been proposed to model future EE consumption and improve forecasting performance. These methods can be divided into two categories: conventional statistical methods and methods based on artificial intelligence (AI).

Statistical methods provide explicit mathematical models where the load is represented as a function of several input factors. These were the first used methods, and for years represented the benchmark among systems for STLF. All of these methods, which include smoothing techniques, data extrapolation and curve fitting, assume that the load data have an internal structure. Autoregressive moving average (ARMA) models were among the first used in STLF [1–3]. Soon they were replaced by autoregressive integrated moving average (ARIMA) models [4] and seasonal ARIMA models [5] to deal with time variance often exhibited by load consumption profiles. Other examples of statistical methods used in STLF are multiple regression [6], exponential smoothing [7], adoptive load forecasting [8,9] and Kalman filtering [10,11]. The major weakness of these approaches is their assumption of the linearity of the observed system. EE forecasting is a complex multivariable and multidimensional estimation problem, and these methods are not always suitable for finding the nonlinear relationship between the independent influencing variables and the EE consumption.

On the contrary, advanced ML methods are suitable for finding patterns and regularities in the data and use them to forecast future EE consumption. ML based methods have shown great performance in the field of STLF. The most commonly used ML algorithms for STLF are support vector machines (SVM) [12,13], random forest [14,15] and artificial neural networks (ANNs) [16]. However, as shown in numerous studies and in the benchmark Global Energy Forecasting Competition 2012 (GEFCom2012) [17], very often, simple ML methods applied to manually crafted complex features (polynomial and exponential interaction features combining multiple variables) achieve better and more robust performance [18]. These features often use the lagged and recency effect, first introduced in [19]. One of the winning teams [20] at GEFCom2012 used lagged hourly and average daily temperature variables in the competition. They applied a gradient boosting algorithm to learn the dependencies between features and target variables. Another winning team at GEFCom2012 [21] used exponentially smoothed temperature variables. They used generalized additive models and kernel regression for long-term load and medium-term forecasting, and random forests for short-term load forecasting.

Over the past few years, DL has been a subject of intense study in many fields, especially in time-series prediction. Deep neural networks (DNNs) have shown the capability to approximate any complex function with arbitrary precision. In [22], the authors showed that some DNN architectures are able to outperform classical ML approaches in the load forecasting task. The authors of [23] proposed convolutional neural network (CNN), as an effective and accurate approach for household-level load forecasting. They showed that CNN is able to capture short-term trends in load data and that a data-augmentation technique can improve the load forecasting accuracy. Compared with conventional feedforward neural networks, recurrent neural networks (RNNs) have the particular advantage of coping with historical data through a feedback connection. In [24], the authors presented a deep RNN to predict electricity consumption for commercial and residential buildings. As an extension of RNN, long short-term memory (LSTM) networks have been used in the load forecasting field in the last few years [25]. The authors of [26] utilized two types of LSTM networks (standard and encoder-decoder architecture) to make predictions for one household. The authors of [27] proposed enhanced-LSTM for EE consumption forecast of a metropolitan power system in France. Their method takes into account the periodicity characteristic of the load consumption by using multiple sequences of input time lags, and achieves higher performance than a single-sequence LSTM. Moreover, different hybrid architectures have been explored in order to avoid the limitations of individual models. A hybrid approach for STLF is presented in [28], where the authors processed the load signal in parallel with a LSTM and CNN. The features generated by the two networks were then used as input in a fully connected network in charge of forecasting the day-ahead load. The authors of [29] proposed

a hybrid model which combines general regression neural network (GRNN), minimal redundancy maximal relevance technique and empirical model decomposition. The efficiency of the model is validated on aggregated load data from a power system in China. It shows higher forecasting accuracy than single GRNN and SVM. In [30], a hybrid method is proposed, which combines LSTM, empirical mode decomposition and similar-days selection to build a prediction architecture for short-term load forecasting. The authors concluded that the robustness of individual methods in the hybrid scheme can be an advantage for the forecasting model.

#### *2.2. Related Studies According to User Type*

According to the type of user, EE consumption forecasting approaches can be divided into those that focus on industrial entities (industrial consumption) and those that focus on households (residential consumption). The industrial approaches focus on entities such as factories, enterprises and companies, and have substantial commercial potential because industry consumes significant amounts of EE. STLF for industrial entities in Spain is discussed in [31]. The authors presented a neuro-fuzzy system with a backpropagation learning algorithm and compared the results achieved with those of other techniques, such as multilayer perceptron and statistical ARIMA processes. In [32], the authors present a model for STLF for a hospital in China. They combined LSTM and CNN and explored the network performance by considering coupling of electrical loads, gas and heating. The authors of [33] introduced an ARMA model for load forecasting of industrial companies, with focus on EE consumption profiles where stochastic changes in the regime can be observed. In [34], a set of multiple linear regression models are developed for modeling industrial loads. The data used in the study were collected from an Italian factory. In this study, the authors showed how few qualitive variables characterize the production schedule. In [35], the authors develop different models for forecasting the next hour load using data from a Spanish industrial pole. With an optimized model for single-hour prediction, a hybrid strategy was applied to build a complete day-ahead hourly load forecasting model. In general, the studies related to industrial EE consumption provide more accurate models compared to households, probably because industrial entities have strict regulations (i.e., shifts and working time), which makes the forecasting less challenging.

On the other hand, residential EE consumption is more challenging to forecast. Each household has its own pattern and electricity consumption profile, which are determined by the number of occupants, their lifestyle, the household area, electrical appliances present in the household, etc. Additionally, household-level EE consumption can vary considerably from one day to the next due to work schedules, holidays, weather conditions, etc. Therefore, most of the approaches in this field tend to avoid such uncertainty by using load aggregation: they focus on forecasting EE consumption of clusters of households, usually grouped by location (i.e., buildings and neighborhoods). Load aggregation usually reduces the inherent variability in load consumption, which results in smoother load shapes that are more predictable. This effect is illustrated in Figure 1.

In [36], the authors used clustering method to divide different types of households. For each cluster, a neural network is fitted, and their forecasts are added together to form predictions for the aggregated load. The authors demonstrate that clustering significantly increases forecast accuracy. Similarly, in [37], the authors propose a three-step process, consisting of clustering approaches, load forecasting for each cluster, and aggregating the forecasts to obtain results at a system level. The authors of [38] also show that aggregating more households improves the relative forecasting performance. They compare load forecasting accuracy at various levels of aggregation for many forecasting methods. In [39], the load consumption forecasting problem is addressed using random forest and support vector regression (SVR). Predictions are made on three spatial scales, and the obtained results show that combination of K = 32 clusters and random forest yields highest forecasting accuracy.

The systems that focus on neighborhoods lose vital information about each household; thus, they have lower commercial value, i.e., such systems cannot monitor and learn the behavior of individual households. Therefore, they cannot offer personalization and planning of EE consumption, which

will be useful for cost reduction. There are just a handful of recent studies covering short-term load forecasts (e.g., day-ahead, hourly) for individual households, since they are still very challenging. The authors of [40] present a pooling-based deep recurrent neural network (PDRNN), which batches groups of customer EE consumption profiles into a pool of inputs. The authors of [41] applied Kalman filtering to single household data for a sampling period and forecast horizon of one hour. In [42], an approach is proposed to model the load of individual households based on daily schedule pattern analysis and context information.

**Figure 1.** Weekly electrical energy (EE) consumption of: (**a**) 1 household, (**b**) 15 households, (**c**) 50 households, (**d**) 100 households.

However, the authors focus on predicting consumption with a prediction horizon shorter than one day, which does not have the same economic value as one-day-ahead hourly forecasts. Typically, the results of day-ahead forecast are used as a baseline for planning of the 24 h period of the next day, while forecasts with forecast horizon shorter than one day (intraday forecast) are mostly used for adjustment of day-ahead purchases [43]. Accurate day-ahead forecast minimizes the possibility of overproduction and underproduction, and satisfies load requirements in a more economical way, thus reducing the total operation costs [44].

Our proposed solution for EE consumption forecasting includes short-term forecasting (day-ahead forecast, for each hour of the day separately) for household consumption, which has significant economic and industrial value. In our study, we focus on STLF of individual households, which we believe is very specific and challenging due to the variability in consumption and randomness of households.

#### **3. Dataset**

#### *3.1. Pecan Street Dataset*

In order to develop a model that can accurately and reliably forecast the EE consumption, we performed a thorough analysis of the existing datasets. We analyzed most of the datasets in this domain and then selected Pecan Street dataset as the most appropriate one for our study. An extensive analysis of other relevant datasets and their characteristics can be found in Appendix A.

The Pecan Street dataset is one of the richest datasets related to residential EE consumption. It consists of EE consumption data, obtained from approximately 1000 households in the USA, mainly Austin, Texas. The dataset contains the actual EE consumption values from each household in one-minute intervals, collected by eGauge devices [45]. Our analysis is based on hourly household EE consumption, given in kilowatt-hours (kWh). Descriptive statistics of the EE consumption are provided in Table 1.

**Table 1.** Descriptive statistics of EE consumption.


Figure 2 shows the average daily EE consumption, i.e., each line in the figure represents average EE consumption for one day in the dataset. Each line is obtained by averaging the load consumption values for each hour in the day separately. The dashed line represents the mean EE consumption at hourly intervals.

**Figure 2.** Average daily EE consumption.

Additionally, the Pecan Street dataset contains extensive weather data for the observed region. STLF is mainly influenced by weather parameters, because heating, ventilation and air-conditioning (HVAC) are highly dependent on outdoor temperature, humidity, wind speed, etc. Figure 3 shows a

two-dimensional heatmap of EE consumption. The heatmap represents average hourly consumption in appropriate time intervals with predefined colors, where warmer colors represent higher consumption. Figure 3 shows that there is a noticeable increase in average electricity consumption in the summer months. This is specific to this dataset, i.e., it is collected in Texas, USA, where the summer temperature is significantly high, and there is increased use of air-conditioning. Therefore, the steady increase in EE consumption during the summer months can be attributed to the use of air-conditioners.

The data used in this study were collected from 925 households for a period of almost four years (2015, 2016, 2017, and nine months of 2018). In order to accurately evaluate the proposed forecasting model's performance, we divided the data into three parts: (i) 27 months were used for training data (6 even-numbered months of 2015, all of 2016, and the first 9 months of 2017). (ii) Six months (odd-numbered months) from 2015 were taken for validation data. (iii) The last 12 months were chosen for test data (last 3 months of 2017 and 9 months of 2018).

#### *3.2. Dataset Preprocessing*

One of the most important steps towards developing an accurate ML model is data preprocessing. This process prepares the data for analysis by dealing or removing the data that is incorrect, incomplete, irrelevant, duplicated or improperly formatted. The preprocessing of the dataset included the following steps:


#### **4. Methodology**

In the day-ahead electricity market, generation companies and retailers submit supply and demand orders for every hour of the following day. Therefore, the focus of our work was to create a model that can forecast electricity consumption one day ahead, at 10:00, for every hour of the following day (shown in Figure 4) [46].

**Figure 4.** Day-ahead forecast timeline.

This timeline allows planning of the production for the following day in accordance with the day-ahead electricity market. According to this timeline, we developed two models that make predictions for different hours of the next day: one for the hours from midnight to 09:00, and one for the rest of the day. The main reason for developing two models is that we want to include the 24-h-before load consumption value for the hours from midnight to 09:00, which, at the time when the predictions are made (at 10:00), are only available for these hours. We considered this as valuable additional information that can improve forecasting for the first nine hours of the following day, because the periodic nature of EE consumption makes the most recent EE consumption values the dominant factor in STLF [47].

#### *4.1. Feature Engineering*

EE load forecasting is a complex multivariable and multidimensional estimation problem. The impacts of many influencing factors that affect load consumption need to be studied in order to develop a precise load forecasting model. Thus, we extracted several features from multiple sources, which can mainly be grouped into two categories: contextual and historical load features.

#### 4.1.1. Contextual Features

• Weather features

The weather is a crucial driving factor in EE consumption. That is why it is a common EE consumption forecasting practice to include weather variables, such as wind speed, humidity and precipitation intensity, in forecasting models. The factor that has the most influence on EE consumption is temperature. Several weather-related features were extracted, and the main focus was on the temperature-related features.

#### • Calendar features

The social element is part of the reason for the hourly, daily and weekly patterns in EE consumption [48]. To allow forecast models to take into account the EE consumption variations which are tied to days, times of the day and seasons, we included some calendar data as nominal features. We also included information about the special days according to the area of interest, Austin, Texas.

#### • Interaction features

We also used interaction features, i.e., combinations of two existing features [49]. The hours in the days of a week may result in different loads due to human activities. For instance, there may be a smaller load on weekend mornings than weekday mornings, because people usually do not get up as early as when they have to go to work. This results in lower EE consumption values. The implementation of this group of features was simply done by multiplying two features.

For a full table with all extracted contextual features, see Appendix B.

#### 4.1.2. Historical Load Features

Load consumption is highly related to historical load, due to its periodic nature. Thus, in this study, historical loads of up to one week were used to predict the day-ahead hourly load.

### • Standard features

Due to the strong daily patterns of EE consumption, it is highly correlated to consumption at the same hour of previous days [50–52]. That is why the following lagged values were used in the training process of the forecasting model:


Features loadt-24h, loadt-25h and loadt-26h are used only for the first model, for the hours from midnight to 09:00 (see Section 4).

• Domain-specific historical load features

Based on the fact that future EE consumption is highly related to historical load, we additionally analyzed four types of time series. The first two take into account the strong daily pattern of EE consumption, and consist of historical load data from the day previous to the day when the predictions are made (all 24 h): one refers to the average load consumption in each hour, calculated from all households present in the system, and the other refers to the load consumption of each household in the same hours. The other two types of time series take into account the significance of the lagged values of EE consumption related to the same hours of previous days. More specifically, one of these time series consists of average values for load consumption (from all households) from hour 24, 48, 72, 96, 120, 144 and 168 prior to the forecasted hour, and the other consists of load consumption of each household in the same hours. As mentioned before, the 24-h-before EE consumption value is only used for instances referring to the first defined interval (midnight to 09:00). It should be noted that in the previous section, the lagged values of EE consumption were used as actual features, but in this section they are used for constructing time series from which additional time and frequency features will be extracted.

To include valuable characteristics about the manner of EE consumption in the feature vector, for each instance we generated a comprehensive set of features based on these four types of time series. The features were extracted using the TSFRESH (https://tsfresh.readthedocs.io/en/latest/text/list\_of\_features. html) Python package, which offers extraction of time and frequency domain features from time-series. We generated 400 new features for each instance. These features include minimum, maximum, variance, correlation, covariance, skewness, kurtosis, number of times the signal is above/below its mean, signal mean change, its autocorrelations (correlations for different delays), etc. These new features give new insight into time-series dynamics, and we believe that they can be significant in improving forecast

accuracy. Figure 5 shows how the four time series are constructed for a forecast for a particular household at 08:00.

**Figure 5.** Domain-specific historical load feature extraction procedure.

#### *4.2. Deep Residual Neural Network*

DL is part of ML, and is based on artificial neural network architecture [53]. DL allows models comprised of numerous processing layers to learn data representations with multiple levels of abstraction. DL architectures have been applied to many fields, where they have produced results comparable, or in some cases superior, to those of human experts.

One type of DNN that was recently proposed is the deep residual neural network (DRNN). This type of deep network has performed extremely well on natural language processing tasks [54,55] and has emerged as a state-of-the-art architecture in computer vision, image segmentation and object detection [56,57] More recently, architectural variants of DRNN have also been used in load forecasting, where they have shown improvement in aggregated load forecast compared to conventional regression models [58,59]. Therefore, in this work we further explore the effectiveness of DRNN architecture in day-ahead load forecasting for single households. A DRNN can easily be constructed by stacking several residual blocks (Figure 6a). In the residual block, a mapping from *x* to Θ is learned, where Θ is

a set of weights related to the residual block. Accordingly, the general representation of the residual block can be written as shown in Equation (1):

$$H(\mathbf{x}) = F(\mathbf{x}, \; \Theta) + \mathbf{x}. \tag{1}$$

**Figure 6.** (**a**) Deep residual neural network (DRNN) structure; (**b**) original residual block; (**c**) pre-activation variant of residual block.

The forward propagation of the structure, where *k* residual blocks are stacked, can be represented as shown in Equation (2):

$$\mathbf{x}\_{\!\!\!\times} = \mathbf{x}\_0 + \sum\_{i=1}^{K} F(\mathbf{x}\_{i-1}, \ \Theta\_{i-1}) \tag{2}$$

where *x*<sup>0</sup> and *xK* are the input and the output of the residual network, respectively, and Θ*<sup>i</sup>* = {Θ*i*,*<sup>l</sup>* <sup>1</sup> <sup>≤</sup> *<sup>l</sup>* <sup>≤</sup> *<sup>L</sup>* + is the set of weights related to the *i*th residual block, *L* being the number of layers within the block. Basically, *x* has no parameters and only adds the output from the previous layer to the layer ahead. The original structure of a residual block used for building a DRNN is shown in Figure 6b.

As DRNNs gain more and more popularity in the research community, their architecture is more intensely studied. There are many proposed interpretations of DRNN architecture and variants of residual blocks. For our DRNN architecture, we used a pre-activation variant of the residual block, proposed in [60]. In this residual block, the activation function rectified linear unit (ReLu) and batch normalization (BN) are used as pre-activation of the weight layers, in contrast to the conventional approach of post-activation. The residual block used for building the DRNN architecture is shown in Figure 6c. In our case, instead of using convolutional layers as weight layers within the block, we used dense layers, making the network more applicable for feature-based input.

#### *4.3. Proposed Architecture for Household Electrical Energy Consumption Forecast (HousEEC)*

In this section, we present our proposed architecture for STLF, which is based on a deep residual neural network. First, we collect daily EE consumption, weather and calendar data. Weather and calendar data are used for extracting contextual features (see Section 4.1.1). From the daily EE consumption data, we extract standard historical load-related features referring to a particular

household, or average values for all households in the system. Additionally, we define four time series (see Section 4.1.2) to extract domain-specific historical load features. The values of the extracted contextual and load-related features are then transformed in such a way that their distribution is centered around 0 (has a mean value 0) with a standard deviation of 1. This is done feature-wise, i.e., independently for each feature.

The structure of the DRNN for load forecasting is illustrated in Figure 7. The input features are separated into two groups, and each group is used as input in a separate branch. One branch uses contextual features in combination with the classical historical load features as input, and the other uses only domain-specific features. The left branch starts with a residual block containing 32 neurons in the fully connected layers, while the right branch starts with a residual block containing 64 neurons in the fully connected layers. The use of fully connected layers instead of the original convolutional layers in the residual blocks makes the network more applicable for feature-based input and regression [61]. The output of the first two branches is then concatenated with the raw input features, and as such is fed to a DRNN with five additional residual blocks. Each residual block consists of two fully connected layers, activation function and batch normalization. The fully connected layers in the blocks consist of 64, 32, 16, 16 and 8 neurons, consecutively. All such layers in the residual blocks use ReLu as the activation function. Mathematically, it is defined as f(x) = max(0,x), which makes it suitable for the STLF problem, since the forecasted consumption cannot have negative values. Additionally, we used a dropout rate of 0.1 in order to reduce the chances of overfitting. A total of 6 levels of residual blocks are stacked (1 input level with 2 residual blocks and an additional 5 levels after the concatenation block), forming a 12 layer DRNN.

**Figure 7.** Proposed deep residual neural network (DRNN) architecture for short-term load forecasting (STLF).

#### **5. Experimental Setup**

#### *5.1. Evaluation Metrics*

In order to evaluate and compare the models, several evaluation metrics were used: root-mean-square error (RMSE) [62], mean absolute error (MAE) [63], and R2 score [64], which are well-known metrics used to measure performance on regression tasks.

MAE and RMSE are directly interpretable in terms of the used measurement unit (kWh in our case). RMSE is a measure that shows how much the residuals are spread out. Residuals are the difference between actual and predicted values. The definition of RMSE indicates that large errors have higher weight. Since in our regression problem the forecasted values are in a small range, large errors are particularly undesirable. Since we want to penalize large errors more, we focused more on RMSE. MAE shows how close the forecasted values are to actual values. It is calculated as a mean of the absolute values of each prediction error on all instances of the test dataset. R2 expresses how well the model replicates the observed outcomes, based on the proportion of total variation of outcomes explained by the model. This metric is positively oriented, and its highest value can be 1. RMSE, MAE and R<sup>2</sup> scores are calculated as shown in Equation (3)–(5):

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left( y\_{predicted} - y\_{trunc} \right)^2} \tag{3}$$

$$\text{MAE} = \frac{1}{n} \sum\_{i=1}^{n} |y\_{predicted} - y\_{true}| \tag{4}$$

$$\text{R}^2 = 1 - \frac{\sum\_{i=1}^{n} \left( y\_{\text{predicted}} - y\_{\text{true}} \right)^2}{\sum\_{i=1}^{n} \left( y\_{\text{true}} - y\_{\text{average}} \right)^2},\tag{5}$$

where *n* is the number of data samples.

#### *5.2. Reference Models*

A reference model or benchmark uses simple summary statistics to create predictions. These predictions are used to measure the benchmark performance, and then this result becomes what we compare our ML model results against. For this study, we implemented three baseline models. One model provides the amount of consumed EE by a specific user 24 or 48 h before the hour of prediction. The 24-h-before value is used for prediction of instances in the first interval, midnight to 09:00, and the 48-h-before value is used for prediction of instances in the second interval, 10:00 to 23:00. Another baseline model is the Vanilla multiple regression Benchmark model [19]. This model uses multiple sources of data to predict future load; in particular, polynomials of temperature and their interaction with calendar variables. To enhance the accuracy of STLF, we augmented the Vanilla multiple regression model by adding some lagged load variables, as well as other combinations of variables that enhance the interaction effect. The last benchmark model is seasonal autoregressive integrated moving average (SARIMA) [65].

For more detailed explanation of the reference models, see Appendix C.

#### **6. Experimental Results**

To explore the performance of our proposed model in EE consumption forecast, we did a series of experiments. Sections 6.1–6.5 present numerous comparisons of results for disaggregated hourly load forecast, and Section 6.6 presents the efficiency of the proposed method in aggregated load forecast. Section 6.7 presents a general model to overcome the cold start issue.

#### *6.1. Comparison of Forecasting Techniques*

To verify the predictive performance of our STLF model, we made comparison with the previously mentioned benchmarks (see Section 5.2), as well as other ML algorithms—linear regression [66], K-nearest neighbors (KNN) [67], decision tree regressor [68], random forest [69], linear SVR [70], gradient boosting [71,72] and xgboost [73] (see Appendix D). We also considered a classic DRNN, comprising five residual blocks, that takes all features together as input for the first residual block.

Table 2 shows RMSE, MAE and correlation R<sup>2</sup> score for each model and the two benchmarks. A comparison of the performance of the models using different sets of features was also conducted. In the first scenario, only contextual features and standard historical load features were used as input. In the second scenario, the proposed domain-specific historical load features were also included. From the results, the benefit of including domain-specific historical load features can be seen. In almost all cases, the proposed domain-specific historical load features significantly improved the model performance. In addition, the results show that our proposed input structure of the DRNN significantly improves the forecasting accuracy. Our proposed model outperformed all other models in both scenarios, achieving RMSE of 0.44 kWh, MAE of 0.23 kWh and R2 score of 0.90.



Computation time for execution of models' training and testing is important for practical implementation in a system, in the case of models retraining with new data and making daily predictions. The training and testing times of the models used in the experiments are shown in Table 3. In all, 2,544,962 instances were used for training and 1,654,499 for testing the models. The deep learning models were trained and tested on NVIDIA Titan X GPU, with 12 GB GDDR5X memory and memory bandwidth of 480 GB/s, while the conventional ML models were trained and tested on AMD Ryzen 7 2700 CPU with 8 cores and maximum clock frequency of 4.1 GHz.

**Table 3.** Computation time for model training and testing.


#### *6.2. Error Analysis of Application Scenarios*

### • Hourly forecast

Figure 8 shows the RMSE score for each hour of the day. The results are obtained by averaging the errors for all users for each hour. Larger error can be observed for 03:00, 23:00, and the afternoon hours when most people return from work and perform different activities at home.

**Figure 8.** Root-mean-square error (RMSE) for each hour of the day.

However, our model reports quite low error for the morning hours, which is significant because morning hours are related to increased EE consumption, especially on workdays. Overall, there is no significant difference in the reported error for any specific part of the day. Our model significantly outperforms the benchmark model for each hour of the day.

• Weekday forecast

Figure 9 shows the RMSE score for each day of the week. The results are obtained by averaging the errors for all users for each day. The benchmark makes a larger error for weekend days; they are more challenging to forecast due to vacations, trips and irregularities in peoples' lives. However, our model performs similarly for each day of the week, regardless of the uncertainties that are usually present on weekend days.

**Figure 9.** Root-mean-square error (RMSE) for each day of the week.

• Monthly forecast

Figure 10 shows the RMSE score for each month of the year. The results are obtained by averaging the errors for all users for each month. Both the benchmark and our proposed method follow a similar trend in terms of the prediction error; the RMSE score is lowest for the spring months when there is no need for heating or cooling. The largest error made by our model can be observed for May, when the cooling season starts. However, after some time, the increased trend of EE consumption is incorporated into the extracted features, so the prediction errors start decreasing. This is a very important characteristic of our model, since the rest of the summer months are also characterized by increased EE consumption. This is certainly not the case with the benchmark model, which reports the largest error when EE consumption is at its peak.

**Figure 10.** Root-mean-square error (RMSE) for each month of the year.

#### *6.3. Comparison with Other Deep Learning Approaches that Use only Time Series*

We additionally made performance comparisons between our method and the most recent DL architectures relevant to load forecasting, described in [74]. The authors present seven architectures designed for 24 h prediction and evaluate them using the individual household electric power consumption (IHEPC) [75] dataset, which contains 47 months of EE consumption data of single households. Based on their results, we chose the five best architectures and evaluated them using the Pecan Street dataset with classical feedforward neural network (FFNN), deep residual neural network (DRNN), temporal convolutional network (TCN), long short-term memory (LSTM) and gated recurrent unit (GRU). This DRNN uses different residual blocks compared to the one in our proposed model. All mentioned networks are described in detail in the paper. For this evaluation, we used seven week-long time series as input for the networks, two related to the historical load and five related to weather data. The first time series is actual EE consumption by a specific household in the past week, and the second is average load consumption by all households in the past week. The weather-related time series are temperature, humidity, apparent temperature, wind speed and precipitation. The Pecan Street dataset contains weather and load measurements for each hour, resulting in 168-hourly-measurements long input and 24-hourly-measurements long output for the networks. Since the results in the paper showed that including calendar information improves prediction accuracy, we additionally included the following information: hour of the day, day of the week, month and work-/non-workday. For training, we used the multiple input–multiple output (MIMO) strategy, meaning that a single predictor is trained to forecast a whole 24 values-long output sequence in a single shot.

Table 4 shows the results: HousEEC shows better results in terms of RMSE, MAE and R<sup>2</sup> compared to end-to-end DL-based methods for load forecasting on the household level. The main conclusion that can be drawn from these results is that the time-series consisting of 168 historical load values does not contain enough information for proper training of DL end-to-end architectures. However, one-week historical load appears to be enough for proper training of the feature-based DRNN, especially when it is trained with extensive feature sets consisting of the domain-specific features which give new insights into the load time-series dynamics.


**Table 4.** Performance of end-to-end deep learning (DL) approaches.

#### *6.4. State-of-the-Art STLF on Household Level*

The STLF field lacks a unified comparison between conducted studies. There are many studies in this field that address different segments related to load forecasting, and most of them are not directly comparable. Nevertheless, we believe that a summary of the results achieved with state-of-the-art methods might be informative and useful for new studies in a few ways. Authors can select the most commonly used dataset for their work in order to produce comparable results, and it can help researchers to avoid selecting nonrepresentative data for evaluation of their methods. In this section, selected studies relevant to STLF on the household level are presented. The two criteria for study selection were the forecasting horizon (up to 24 h) and the evaluation metric (RMSE). In order to include more relevant studies, we additionally considered studies reporting normalized root mean squared error (NRMSE), calculated as shown in Equation (6):

$$\text{NRMSE} = \frac{\sqrt{\frac{1}{n}\sum\_{i=1}^{n} \left(y\_{predicted} - y\_{true}\right)^2}}{\frac{1}{n}\sum\_{i=1}^{n} y\_{true}}.\tag{6}$$

We ended up with 12 relevant studies, including ours. Table 5 presents a summary of the studies in terms of forecasting horizon, number of households used for evaluation, duration of the test data, and results achieved in terms of RMSE (NRMSE). One parameter that should be considered in this comparison is the size of the data used for evaluation. EE consumption is highly affected by the weather; a lot of electricity is used for cooling in summer and heating in winter. This leads to the conclusion that studies that use shorter periods for their evaluation might present unreliable results without checking model performance in different seasons. Only one of the selected studies evaluated their methods using data collected in a period of 12 months. In order to show how robust the proposed methods are, more households are needed for evaluation. This is because there are different types of users, such as elderly people who spend most of the time at home, people who go to work, students who have a dynamic lifestyle, etc. Only four studies considered datasets with fewer than 100 households.

**Table 5.** Summary of state-of-the-art STLF studies.


For our work, we addressed the previously mentioned challenges; our model provides forecasting one day ahead, and the results are evaluated on 297 households over a period of one year. We believe that our results are very promising, considering that they show great performance of the model for a large number of households evaluated for a period of 12 months.

#### *6.5. Analysis of Di*ff*erent Lengths of Training Set*

Over time, new households with different EE consumption patterns can appear in the forecasting system. Therefore, it is a common practice for forecasting models to be trained with new data after a certain time. This section presents the HousEEC model's performance for three subsets of the initial test set, when additional data is used for training. For comparison, we used the final HousEEC model (trained with 27 months) to predict the EE consumption of the three new test subsets. Table 6 shows the RMSE, MAE and R<sup>2</sup> scores for different train-test splits.


**Table 6.** Performance of models with different train-test splits.

Even though it is expected that constant inclusion of new data expands the knowledge of the existing model, the results from this analysis showed that there was no significant benefit of it when there were no changes in the dataset in terms of new households.

#### *6.6. Aggregated Consumption Forecasting*

Forecasting of aggregated load can be implemented by the standard strategy of direct forecast of the aggregated load, or by aggregating the forecasts for individual households. In Figure 11, we observe the curve of aggregated forecasts for all households and compare it to the actual aggregated load curve. The observed period is the first week of July. It is one of the hottest months in the year, characterized with increased EE consumption due to air-conditioning (Figure 3). Since the peak EE consumption is the most challenging to forecast, we more closely observed the model's performance for a whole week in July—the month during which the EE consumption is the highest in our dataset. It is obvious that the forecast successfully follows the trend of actual consumption, even for July 5, when a significant drop of EE consumption is noted, which is not typical for the time period observed.

**Figure 11.** Weekly aggregated EE consumption.

Finally, we calculated total consumption of all households and total error for hourly and daily analysis. The hourly analysis showed that, on average, the aggregated EE consumption of the households is 263 kWh. Our model makes 8% error on average per hour forecast (22 kWh); the Vanilla multiple regression benchmark makes 12% error (32 kWh). The daily analysis showed that, on average, the aggregated EE consumption of all households is 6140 kWh. Our model makes 2% error on average per day forecast (131 kWh); the baseline makes 6% error (360 kWh).

#### *6.7. Cold Start Issue*

In order to predict next-day EE consumption of a new household in the system, the HousEEC model requires the household's historical EE consumption of the previous week. This means that it suffers from a one-week cold start, which is a technical limitation of the model. To overcome this limitation, we trained an additional general model that does not use household-specific standard historical load features and domain-specific historical load features that are extracted from the third type of time-series (see Figure 5). This model will to serve as a model for prediction of household EE consumption for only the first week. We used the same HousEEC architecture, and the only difference was the number of input features. The performance of this model on the same test set used in the previous experiments can be seen in Table 7. As expected, this general model provided less precise results compared to the final HousEEC model. However, we consider these results as acceptable, given that this general model would be used for only a short period of time in an actual implementation of the system. The presented results are also additional evidence of the significance of domain-specific historical load features for a particular household.

**Table 7.** Comparison of performance of general and HousEEC models.


#### **7. HouseEEC System Prototype**

This section presents the practical implementation of the HousEEC system and deployment of the ML model in a prototype web application. The system enables end users to quickly and easily access a service that allows different analyses. One of the most important features of this system is that it can be easily implemented in larger systems that have different monitoring devices for electricity consumption in households. The only prerequisite for implementing such a system for analyzing and forecasting electricity consumption is access to the measured values of household EE consumption. The system contains three main modules:


A visualization of the system and its households is shown in Figure 12. For better visualization, multiple households that are very close geographically are presented as a group (blue circles on the map). Note that this is a simulation, and the geographic locations are for illustration purposes only; the dataset does not contain location information about the households. Next, the application provides

access to a table of measured EE consumption of all households for the last 24 h. In addition, there is an option to search for a specific household, which can provide insight into its individual time series of EE consumption. This table also enables easy control of the accuracy of household measuring devices: whether they show values or whether there are erroneous values in the metered data (negative values for consumed electricity or values outside the expected range). If unexpected data are spotted in the table, they can be deleted from the database, preventing them from affecting prediction by the ML model in the following days.

**Figure 12.** Map of simulated households' locations.

The next service of the system represents daily predictions of EE consumption for each hour of the next day. These forecasts are obtained by executing the ML model at 10:00 every day. This allows sufficient time for planning the actions of the day-ahead electricity market, which, as mentioned before, closes at 12:00. Although forecasts are obtained at the household level, they are presented at the aggregated level for all households. Figure 13 shows three lines, representing predicted electricity consumption achieved by the three chosen models: random forest, benchmark and our final HousEEC.

**Figure 13.** Aggregated hourly predictions of EE consumption for the next day.

The final service offered by the system is the performance analysis of the ML models (Figure 14). With this service, users can load predictions for the past period and compare them to actual consumption values. First, the user selects the interval of interest and the models. Then the system outputs the predictions and true consumption. For example, Figure 14 shows predictions of the random forest and HousEEC models and the true consumption for the randomly selected period from 1–15 June 2018. The user can visually inspect the model errors.

**Figure 14.** Historical performance comparison of predicted to true EE consumption.

When using real-time data collection devices, it is inevitable that some amount of data gets lost due to different circumstances (sensor fault, communication error, environmental disturbance, etc.). In this context, the use of techniques that deal with missing data is a crucial part of the implementation of a forecasting system. To guarantee that our forecasting system would work smoothly, we considered two cases of missing data and appropriate techniques to handle it. The first case is missing values of EE consumption for one hour for a particular household. For this case, we implemented linear interpolation, a mathematical method that adjusts a function to the existing data and uses it to extrapolate the missing data. The second case is when sensor readings are missing for two or more consecutive hours for a particular household. In this case, the missing values are replaced with the forecasted values for those hours by the HousEEC model (or the general model, if the missing values occur in the first week of the data collection process for the household; see Section 6.7).

#### **8. Conclusions**

The paper presents the HousEEC system, which provides day-ahead household EE consumption forecasting using a deep residual neural network. The experimental evaluation was performed on one of the richest datasets for household EE consumption, the Pecan Street dataset. The DL approach combines multiple sources of information by extracting features from (i) contextual data (e.g., weather, calendar), and (ii) the historical load of the particular household and all households present in the dataset. Additionally, we computed novel domain-specific time-series features that allow the system to better model the pattern of household energy consumption. Their contribution to reducing the error is shown in Table 2. Finally, we assessed performance by comparing the results achieved with our model with those of seven other ML algorithms, five DL and two benchmarks widely used in this area.

The experimental results show that in all cases, our model outperformed every other algorithm and approach, achieving RMSE of 0.44 kWh, MAE of 0.23 kWh and R<sup>2</sup> score of 0.90. The analysis shows the great potential of including our domain-specific historical load features in improved load forecasting. The hourly analysis showed that all customers used 263 kWh per hour on average. Our model makes 8% error on average per hour forecast (22 kWh), which is 4 percentage points better than the benchmark model results. The daily analysis showed that all households used 6140 kWh per day on average; our model makes 2% error on average per day forecast (131 kWh) and the benchmark model makes 6% (360 kWh). The comparison between end-to-end DL architectures and our proposed DL feature-based architecture showed that our method performs better, achieving significantly lower RMSE compared to the best performing end-to-end DL architecture, the temporal convolutional network. We believe that the main reason for this improvement is the domain-specific features, which give the algorithms the most relevant information derived from the raw data for future load forecasting. According to the analysis of similar studies for STLF for households, we can say that our achieved results are very promising compared to the state-of-the-art approaches. We also believe that our study shows reliable results because the method was tested on a significantly large number of households over 12 months using a 24 h forecasting horizon.

The proposed method, which predicts EE consumption on an individual household level, offers great commercial potential because it is scalable and not dependent on the current number of households in the system. In addition, predicting individual forecasts enables their aggregation, which yields better forecasting for the aggregation level compared to the conventional strategy of direct forecast of the aggregated load [25,84]. Our method also has significant value because it is not dependent on the number of households included in the system. Implementation of the system does not suffer from cold start; we addressed the cold start problem by introducing a new general model that does not use household-specific historical load features. This model is intended to provide predictions for each new household that appears in the system for the first week, until the required data for the HousEEC model is collected. Another important detail that we considered in the system implementation is the occurrence of missing data. We tackled this by using two techniques, interpolation and the use of forecasted values to fill the missing data in the EE consumption of a particular household.

We expect that the final model could perform well on other datasets which contain EE consumption data for households with similar economic status, located in places with similar weather conditions. It was trained with data from a large number of households, which make it more general, robust and able to adapt to many different EE consumption patterns. Additionally, if the HousEEC architecture is used for model re-training with new data, we expect it to show equally good performance, since it incorporates data from multiple relevant sources that affect the EE consumption of households. However, further investigation of the model's performance on other datasets is considered for future work.

Another improvement would be to introduce additional features, such as EE price, number of household members, age of users, daily schedule of users (working hours), size of household, and means of heating and cooling. We believe that these attributes would improve the accuracy; however, this requires additional private information about households, which might not be easy to obtain.

Finally, we plan to introduce the clustering of households, because there are different trends and patterns for each household in the dataset and there are large variations in the electricity consumption patterns at the household level. A clustering algorithm would group similar households into clusters and, in a way, define household profiles. This way, there will be several prediction models for different clusters of households. We believe this can increase the forecasting performance, because there are several types of users (active users who regularly go to work, older users who spend the biggest part of the day at home, etc.), and it is more difficult for the model to acquire a knowledge about the EE consumption characteristics of different users.

**Author Contributions:** I.K. and S.S. were the main authors who significantly contributed to the research; in particular, they were responsible for dataset analysis, implementation of methods, experimental evaluation, and writing the paper. I.I. and S.J. were responsible for defining the energy consumption prediction problem, and conceptualization of the approach and the system. M.G. was responsible for the machine learning part of the research, in particular the definitions of algorithms, the concepts, and the ideas regarding how to tackle the problem. H.G. was responsible for the research as a whole, leading and guiding the study, the problem definitions, conceptualization of the methods, definition of the experiments, and writing the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** We are truly grateful to Pecan Street Inc. for giving us the opportunity to use their dataset for our research purposes. We are also thankful for the support of the NVIDIA Corporation and their generous donation of a Titan Xp GPU that we utilized for this study.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**


\* Ongoing collection. \*\* Public access for research purposes by university members.

#### **Appendix B**

**Table A2.** Contextual features.


T, temperature; Tdavg, daily average temperature; H, hour; D, day of week; M, month.

#### **Appendix C**

The Vanilla multiple regression benchmark model is a load forecasting model that uses multiple sources of data to predict future load; in particular, polynomials of temperature and their interaction with calendar features. The model can be defined as follows:

$$\begin{split} L\_{t} &= & \mathbb{\beta}\_{0} + \mathbb{\beta}\_{1} Trand\_{l} + \mathbb{\beta}\_{2} M\_{l} + \mathbb{\beta}\_{3} W\_{l} + \mathbb{\beta}\_{4} H\_{l} + \mathbb{\beta}\_{5} W\_{l}H\_{l} + \mathbb{\beta}\_{6} T\_{l} + \mathbb{\beta}\_{7} T\_{l}^{2} + \mathbb{\beta}\_{8} T\_{l}^{3} \\ &+ \mathbb{\beta}\_{9} M\_{l} T\_{l} + \mathbb{\beta}\_{10} M\_{l} T\_{l}^{2} + \mathbb{\beta}\_{11} M\_{l} T\_{l}^{3} + \mathbb{\beta}\_{12} H\_{l} T\_{l} + \mathbb{\beta}\_{13} H\_{l} T\_{l}^{2} + \mathbb{\beta}\_{14} H\_{l} T\_{l}^{3} \end{split} \tag{A1}$$

where *Lt* is the load forecast for time *t*; βi are the coefficients calculated using the ordinary least square method; *Trendt* is an increasing number which presents a linear trend at time *t*; *Mt*, *Wt* and *Ht* are the month of the year, day of the week and hour of the day for time *t*, respectively; and *Tt* is the temperature for time *t*.

The final benchmarking Vanilla model used in this work is defined as follows:

*Lt* = β<sup>0</sup> + β1*Mt* +β2*Wt* + β3*Ht* + β4*WtHt* + β5*Tt* + β6*T*<sup>2</sup> *<sup>t</sup>* <sup>+</sup> <sup>β</sup>7*T*<sup>3</sup> *<sup>t</sup>* + β8*MtTt* + β9*MtT*<sup>2</sup> *<sup>t</sup>* <sup>+</sup> <sup>β</sup>10*MtT*<sup>3</sup> *<sup>t</sup>* <sup>+</sup> <sup>β</sup>11*HtTt* <sup>+</sup> <sup>β</sup>12*HtT*<sup>2</sup> *<sup>t</sup>* <sup>+</sup> <sup>β</sup>13*HtT*<sup>3</sup> *<sup>t</sup>* + β14*Lt*−<sup>26</sup> + β15*Lt*−<sup>25</sup> + β16*Lt*−<sup>24</sup> + β17*Tt*−<sup>26</sup> + β18*Tt*−<sup>25</sup> + β19*Tt*−<sup>24</sup> + β20*TdavgH* + β21*T*<sup>2</sup> *davg<sup>H</sup>* <sup>+</sup> <sup>β</sup>22*T*<sup>3</sup> *davg<sup>H</sup>* <sup>+</sup> <sup>β</sup>23*TdavgM* <sup>+</sup> <sup>β</sup>24*T*<sup>2</sup> *davgM* +β25*T*<sup>3</sup> *davg<sup>M</sup>* <sup>+</sup> <sup>β</sup>26*Tt*−26*<sup>H</sup>* <sup>+</sup> <sup>β</sup>27*T*<sup>2</sup> *<sup>t</sup>*−26*<sup>H</sup>* <sup>+</sup> <sup>β</sup>28*T*<sup>3</sup> *<sup>t</sup>*−26*<sup>H</sup>* <sup>+</sup> <sup>β</sup>29*Tt*−25*<sup>H</sup>* +β30*T*<sup>2</sup> *<sup>t</sup>*−25*<sup>H</sup>* <sup>+</sup> <sup>β</sup>31*T*<sup>3</sup> *<sup>t</sup>*−25*<sup>H</sup>* <sup>+</sup> <sup>β</sup>32*Tt*−24*<sup>H</sup>* <sup>+</sup> <sup>β</sup>33*T*<sup>2</sup> *<sup>t</sup>*−24*<sup>H</sup>* <sup>+</sup> <sup>β</sup>34*T*<sup>3</sup> *<sup>t</sup>*−24*<sup>H</sup>* +β35*Tt*−26*<sup>M</sup>* + <sup>β</sup>36*T*<sup>2</sup> *<sup>t</sup>*−26*<sup>M</sup>* <sup>+</sup> <sup>β</sup>37*T*<sup>3</sup> *<sup>t</sup>*−26*<sup>M</sup>* <sup>+</sup> <sup>β</sup>38*Tt*−25*<sup>M</sup>* <sup>+</sup> <sup>β</sup>39*T*<sup>2</sup> *<sup>t</sup>*−25*<sup>M</sup>* +β40*T*<sup>3</sup> *<sup>t</sup>*−25*<sup>M</sup>* <sup>+</sup> <sup>β</sup>41*Tt*−24*<sup>M</sup>* <sup>+</sup> <sup>β</sup>42*T*<sup>2</sup> *<sup>t</sup>*−24*<sup>M</sup>* <sup>+</sup> <sup>β</sup>43*T*<sup>3</sup> *<sup>t</sup>*−24*<sup>M</sup>* (A2)

where *Tdavg* is the average daily temperature from two days before the forecasted day. This formula represents the benchmark for obtaining the forecasts for the instances from the first interval, from midnight to 09:00. Analogously, for the instances form the second interval (from 10:00 to midnight), instead of using values referring to the time before 24, 25 and 26 h, we used values referring to the time before 48, 49 and 50 h from the forecasted hour. Additionally, we removed the trend variable, since our formulation of the forecasting problem does not meet the requirements for its calculation.

Autoregressive Integrated Moving Average (ARIMA) is one of the most commonly used methods for time-series forecasting. In general, the ARIMA model is noted as ARIMA(p,d,q), where the p parameter is an integer that confirms how many lagged series are going to be used to forecast periods ahead; d parameter tells how many differencing orders are going to be used to make the series stationary; and q is the number of lagged forecast error terms in the prediction equation. Seasonal Autoregressive Integrated Moving Average (SARIMA) is seasonal ARIMA and it is used with time series with seasonality. This model is generally termed as SARIMA(p,d,q) <sup>×</sup> (P,D,Q)S.

#### **Appendix D**


ways to split a data set based on different conditions. After training the model, as a result we have a tree with decision nodes with two or more branches representing values for the chosen feature, and leaf nodes representing a numerical prediction of EE consumption.


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Short-Term Load Forecasting for a Single Household Based on Convolution Neural Networks Using Data Augmentation**

#### **Shree Krishna Acharya 1, Young-Min Wi <sup>2</sup> and Jaehee Lee 3,\***


Received: 16 August 2019; Accepted: 15 September 2019; Published: 17 September 2019

**Abstract:** Advanced metering infrastructure (AMI) is spreading to households in some countries, and could be a source for forecasting the residential electric demand. However, load forecasting of a single household is still a fairly challenging topic because of the high volatility and uncertainty of the electric demand of households. Moreover, there is a limitation in the use of historical load data because of a change in house ownership, change in lifestyle, integration of new electric devices, and so on. The paper proposes a novel method to forecast the electricity loads of single residential households. The proposed forecasting method is based on convolution neural networks (CNNs) combined with a data-augmentation technique, which can artificially enlarge the training data. This method can address issues caused by a lack of historical data and improve the accuracy of residential load forecasting. Simulation results illustrate the validation and efficacy of the proposed method.

**Keywords:** data augmentation; convolution neural network; residential load forecasting

#### **1. Introduction**

Short-term load forecasting (STLF) is an important part of power system planning and operation [1]. The STLF, which has a prediction range of one hour to 168 h, is used for controlling and scheduling of daily power system operations. Furthermore, forecasting customer-level energy consumption is essential for many potential applications in the future power system, such as demand response (DR) programs, home load scheduling with renewables, and optimal operation of energy storage systems (ESS) [2].

Statistical methods, including multiple linear regression [3,4], exponential smoothing [5], and the autoregressive integrated moving average (ARIMA) [6] are the most commonly used for the STLF. Recently, deep-learning-based forecasting techniques are gaining attention in the STLF. Recursive neural networks (RNNs) capable of learning long-term dependence are being applied to the assumption of load prediction in [7]. However, the vanishing gradient problem of RNNs makes it hard to improve forecasting accuracy. In [8,9], to overcome this problem in the RNN, the long short-term memory (LSTM) has been proposed. Some studies show the capability of LSTM for more improved forecasting performance at the system-level forecast when there is relatively long-term historical load data. In addition to the above methods, many deep learning algorithms, such as deep feed forward [10,11], deep belief network (DBN) [12], are being applied to load forecasting. In addition, some hybrid spectral methods, including wavelet analysis [13] and empirical mode decomposition (EMD) [14] with neural networks, have been proposed to remove the uncertainty of historical electrical load. However, all of

these techniques were implemented in the substation- or system-level load forecasting with training from sufficient historical load data.

In the STLF at the system-level, a lot of historical data are available because small electrical load changes do not significantly affect the overall load pattern trend. However, at the household level, STLF may not have enough data for some households to capture long-term dependencies and properly train the learning network. For example, if a homeowner has recently changed, only a small amount of electrical load data will help its load forecasting. In addition, even if homeowners have not changed for a long time, the pattern of energy consumption can change with new electrical devices and lifestyle changes [15]. Therefore, there is a limit to using the previous data to increase the training data size. In [2,7], RNN and LSTM were applied to single household load prediction, but these studies did not take into account the past data shortage issues.

To address the lack of historical data for training, the convolutional neural networks (CNNs) can be an effective method for residential load forecasting, because CNN can capture short-term trends in load data when the local data points are strongly related to each other [16]. In addition, data augmentation can be one solution to handle the issues. Data augmentation can handle short-duration data collection by enlarging the size of the training data in a way that adds extras copies of training examples in training dataset and minimizes the overfitting in deep-learning techniques [17]. The overall training cost of a deep network will be reduced when input data are large and contain more similar information. With such enlarged training data, the accuracy of residential load forecasting can be improved. Technical paper [7] tries to enlarge the input data using another household's load series. However, this approach has the potential to obtain data that have different characteristics of the load series of the target household. Moreover, some target households have similar load profiles in their load series, but others do not [2]. Because of load profile diversity within and between households, existing forecasting benchmarks yield cons rather than pros. In fact, forecasting with only a single household's data may not often have sufficient information to fit a wide variety of the model capacity, particularly in deep-learning techniques. Therefore, proper data augmentation is required, which can artificially create new training data from the historical data of a single household.

Herein, the paper proposes a load-forecasting method for a single residential household based on convolutional neural networks (CNNs). The CNN is a type of deep neural network where its structure is formed by using convolution and pooling layers [18–20]. With the help of various filters, CNNs can learn the inherent information of an electric load series. The proposed forecasting introduces a novel data augmentation technique that concatenates the various residual load series generated from the electrical load of a single household. The original load series is converted to another residual load series containing uncertainty information of electrical load. Several residual load series are extracted through multiple k-means clustering to collect sufficient training data [21,22]. Among the extracted residual load series, the less uncertain and more homogeneous ones were fed to the CNNs as training data. The proposed augmentation technique can provide enough homogeneous training data to CNNs for more accurate forecasting. The proposed forecasting method was tested on ten single households for a year and is compared with the results of pooling-based augmentation [7].

The rest of this paper is organized as follows. Section 2 discusses the implementation of the proposed augmentation strategies and the context of residual load series. Similarly, Section 3 describes the proposed algorithm. Section 4 talks about the implementation procedure and discusses the simulation results. Section 5 concludes the paper.

#### **2. Augmentation Implementation**

#### *2.1. CNN with Augmentation*

Since an electric load series consists of many load profiles, the diversities between the load profiles are a major concern for forecasting using CNN filter networks. In fact, CNNs can facilitate more precisely given less diverse training load profiles rather than highly diverse ones [23]. Because of human actions, weather, and types of day, the differences between load profiles are uncertain, non-linear, and complex [2,24]. On the other hand, the amount of training data is crucial for fitting all the parameters of CNNs.

To address the issues of the uncertainty and necessities of huge training data, the state-of-the-art papers used other households load series for data augmentation. Moreover, the existing approaches do forecasting by hoping that load profiles will repeat and assume that the repetition of load profiles is often similar among households during the same time interval [7,23]. As a consequence, the forecasting accuracy is trivialized when the augmented data contains only correlated households' load series. However, the effects of human actions and the type of day are totally different in the electricity load of a single household and are too difficult to predict. Thus, incorporating other households' load series into training data inevitably burdens the learning ability of the CNN rather than optimizing it.

The augmentation strategy can be applied with single households and is valid when each deep-learning-based forecasting framework improved its performance [25,26]. To improve cognition, augmentation techniques need to grasp all the potentially uncertain information about the electric load series. Figure 1a,b describe the concept of enlarging data with the existing pooling and proposed augmentation approach for deep-learning-based forecasting. To enlarge data, the proposed method artificially generates several augmented load series, each of which needs to be extracted in a way that facilitates the CNN learning strategies. To obtain a rationale, each series should have less uncertain and more homogeneous information. The appropriate series are concatenated with each other to turn out a huge amount of training data with homogeneous information. The more homogeneous information there is, the more useful it is for a CNN network to address the granular-level load prediction. Any dissimilar augmented load series in concatenation could destroy a CNN's optimal cognition. The procedure of extracting a homogeneous information load series is vital and depends totally upon imagination [27]. The proposed augmentation strategies are described in the following section.

**Figure 1.** *Cont*.

**Figure 1.** Comparison of data augmentation strategies: (**a**) pooling technique and (**b**) proposed augmentation technique.

#### *2.2. Extraction of Residual Load Series*

The data structure of the historical electricity load of a target household can be expressed as the following matrix:

$$\mathbf{P} = \begin{bmatrix} p\_{1,1'} \ p\_{2,1'} \dotsm \ p\_{1,t'} \dotsm \ p\_{1,t'} & \dots \\ p\_{2,1'} \ p\_{2,2'} \dotsm \ p\_{2,t'} \dotsm \ p\_{2,T} \\ \vdots \\ p\_{d,1'} \ p\_{d,2'} \dotsm \dotsm \ p\_{d,t'} & \dotsm \ p\_{d,T} \\ \vdots \\ p\_{D,1'} \ p\_{D,2'} \dotsm \dotsm \ p\_{D,t'} & \dotsm \ p\_{D,T} \end{bmatrix} \tag{1}$$

where *pd*,*<sup>t</sup>* represents historical electric load at time *t* in day *d*; *D* represents the number of days of historical data, and the time period *T* is 24 h. This matrix of a historical load can be simplified as follows:

$$P = [p\_{1\prime}, p\_{2\prime}, p\_{3\prime}, \dots, p\_{d\prime}, \dots, p\_{\mathcal{D}}]\_\prime \tag{2}$$

where vector *pd* are the hourly load profiles of a target household in day *d*. Figure 2 shows all daily load profiles of a household for one month. In Figure 2, it can be found that the electrical load of a single household fluctuates abruptly. Generally, a smaller electricity load tends to have a higher variation, which makes it difficult to accurately forecast the residential electricity load. In CNN-based forecasting, one way to achieve high performance is to reduce the variations of training data [21].

To improve the learning ability of CNNs, one must extract new features from a residential load series, which has less volatility but still has inherent characteristics of a residential load. One way to reduce the variation is to remove the regular pattern from the load series [28] so that CNNs use only its residuals. With this approach, each load profile is decomposed into centroid and residual load profiles as follows:

$$p\_d = c + r\_{d,\*} \tag{3}$$

where vector *c* is the centroid load profile (average profile) of the historical load, and *rd* is a vector of the residual load series which is the difference between a given centroid *c* and an actual load profile on day *d*. The centroid load profile can be used as the baseline of a particular group of daily load profiles, and the repetition of the centroid load profile yields an average load series for a certain duration. Figure 3 shows an average load series, and its residual load series of a residential household for a month. In Figure 3, the average load series does not contain any uncertain information, but it shows only a regular pattern. On the other hand, the residual load series contains all the uncertain and complex information about the residential load. The residual load series is less volatile than is the original load series but still has inherent characteristics of the residential load. With training from this less volatile data, the CNNs can forecast the small residential load more accurately.

**Figure 2.** Daily load profile of a single household for one month.

**Figure 3.** Average and its residual load series of a single household.

To ensure the appropriateness of the residual load series in learning networks, auto-correlation (AC) or partial auto-correlation (PAC) coefficients [29] can be used. Higher AC or PAC coefficients, out of the confidence interval, mean that the present residual load series is strongly coupled with the historical residual load series. Figure 4 shows AC and PAC coefficients of the residual load series of a selected single household with different time lagging. As shown in Figure 4, when the lags are the multiple of 24 h, the AC coefficients would peak, so that some AC coefficients are out of the confidence interval. Similarly, the PAC coefficient spikes also confirm that the residual load series has vital information about the residential electricity load. The information is in a hidden state and can be learned by using learning networks.

**Figure 4.** Auto-correlation of residual load series with different time lagging: (**a**) auto-correlation (AC) coefficients, and (**b**) partial auto-correlation (PAC) coefficients.

With these residential load series, the paper proposes a CNN-based method for residential load forecasting using data augmentation. The augmentation technique and overall forecasting procedure are described in the following section.

#### **3. Proposed Residential Load Forecasting Method**

#### *3.1. Generation of Centroid Load Profiles*

To generate the different centroid load profiles from historical residential load P, multiple k-means clustering [21,22] is used in the paper. The multiple k-means clustering to generate the centroid load profiles of a residential load can be express as follows:

$$\text{Minimize } \sum\_{i=1}^{k} \sum\_{\mathbf{p}\_d \in \mathcal{S}\_{k,i}} \|\mathbf{p}\_d - \mathbf{c}\_{k,i}\|^2 \text{ , } k = 1 \text{, } 2, \dots, \mathbf{K} \text{,} \tag{4}$$

where *Sk*,*<sup>i</sup>* is the *i*-th partition of the load profile set P, which is generated with a clustering number of *k*, and *ck*,*<sup>i</sup>* is the centroid load profile of the corresponding partition of *Sk*, *<sup>i</sup>*. The clustering starts with a clustering number 1 and ends with the clustering number *<sup>K</sup>*. Finally, *K*<sup>2</sup> + *K* /2 centroid load profiles are generated. The paper used all centroids generated with clustering numbers of one to *k*, and the *l*-th generated centroid load profile is defined as *c l* . The centroid load profiles (*c l* ) are *K*<sup>2</sup> + *K* /2 daily load patterns, which are the mean values of load patterns of similar days. Figure 5 shows the centroid load profiles of a single household for one month using the multiple k-means clustering algorithm.

**Figure 5.** Centroid load profiles generated using multiple k-means clustering algorithm with different clustering number *k*: (**a**) *k* = 1; (**b**) *k* = 2; (**c**) *k* = 3, and (**d**) *k* = 4.

#### *3.2. Augmentation of Homogeneous Residual Load Profile*

Using multiple k-means clustering and its corresponding centroid load profile, the residential load profiles can be generated as follows:

$$r\_{d\_r l} = p\_d - c\_{l'}'\tag{5}$$

where *r <sup>d</sup>*,*<sup>l</sup>* is the vector of residual load profiles of the target household on day *d*, which is generated with the *l*-th centroid load profile. This residual load profile is expected to be less volatile and less uncertain than was the original load profile. In addition, from Equation (5), the amount of training data will be increased much more by using residual load profiles as training data. Namely, one time series (*pd*) of a single household load can be transformed into several time series of residual load series (*rd*,1, *rd*,2, ... , *rd*,*l*).

For the CNN, the appropriate training data are concatenated with each other, and each training data should have less uncertainty and more homogeneous information. The more homogenous information there is, the more useful it will be for a CNN network to address the granular-level load prediction. Any dissimilar augmented load series could destroy the CNN's optimal cognition. Therefore, one needs to select the more homogeneous residual load profiles from among the augmented load profiles from Equation (5).

To select the more homogenous residual load profiles, the Frobenius norm [30] Φ*<sup>l</sup>* of each residual load profiles are calculated as follows:

$$\Phi\_l = \sqrt{\sum\_{d=1}^{D} \|r\_{d,l}\|^2},\tag{6}$$

where each *r <sup>d</sup>*,*<sup>l</sup>* is sequentially observed with time. When residual load profiles (*r d*,*l* ) have lower Φ*l*, most residuals generated with *c <sup>l</sup>* are closed to its centroid, and these augmented residual load profiles can be expected to be homogeneous. Finally, the paper used only the residual load series with lower Φ*<sup>l</sup>* as training set R*in*,

$$\mathcal{R}\_{in} = \left\{ r\_{d,l} \, \middle| \, \Phi\_l \le \frac{1}{L} \sum\_{l=1}^{L} \Phi\_l \right\}. \tag{7}$$

In addition to the training set, the paper used the residual load profiles with the lowest Φ*<sup>l</sup>* as the test set for the CNN model. Figure 6 shows the structure of training and testing sets of the proposed method. In Figure 6, the residual load profile is expressed with its elements as *rd*,*<sup>l</sup>* <sup>=</sup> \* *r*(1,*d*),*l*,*r*(2,*d*),*l*, ... ,*r*(*t*,*d*),*l*, ... ,*r*(24,*d*),*<sup>l</sup>* + . In the training set, several time series of residual load series (*rd*,1, *rd*,2, ... , *rd*,*l*) can be used instead of one original time series (*pd*) of a single household load. With this enragement of training data, forecasting accuracy for individual residential households can be improved.

#### *3.3. CNN Model for Residential Load Forecasting*

The training process is yielded by running a program with a given number of iterations. To optimize the CNN model, in each iteration, a root mean square is used for the training process as follows:

$$\text{argmin} \quad \sqrt{\frac{1}{L} \frac{1}{D} \sum\_{l=1}^{L} \sum\_{d=1}^{D} \left( r\_{d,l} - r\_{d,l} \right)^2},\tag{8}$$

where *r*ˆ*d*,*<sup>l</sup>* is the predicted vector load profile obtained from the CNN, and *L* represents the number of selected residual load profiles from Equation (6). A well-trained and converged CNN forecasting network is used for testing the process for predicting the future load. Since the CNN network deals with residual load profiles, the forecasting result can be generated in terms of a residual load profile. In fact, the forecast load profile *p*ˆ*D*+<sup>1</sup> can be obtained by adding both the centroid load profile and the forecast residual load profile as follows:

$$
\mathfrak{f}\_{D+1} = \mathfrak{k} + \mathfrak{f}\_{D+1,p\prime} \tag{9}
$$

where *r*ˆ*D*+1, *<sup>p</sup>* represents the day-ahead forecasted residual load profile, and *c*ˆ represents the most appropriate centroid load profile which has the lowest Frobenius norm Φ*l*.

Figure 7 explains the entire load forecasting procedure for the proposed methodology. In the first stage, centroid load profiles are generated using multiple k-means clustering, and different types of residual load profiles are extracted with corresponding centroid load profiles. In the second stage, only homogeneous residual load series are selected using the Frobenius norm for training the CNN. In the final stage, the CNN forecasting framework is used to predict day-ahead residential load profiles. The CNN implementation can be summarized into three parts: (1) initialization of the CNN parameters, (2) training the CNN model with the help of input matrix R*in*, and (3) predicting the day-ahead load profile using the optimally trained CNN model. With the proposed forecasting method, it is expected that the CNN can cognize the characteristics of historical load profiles more accurately, so that the forecasting accuracy is improved interestingly.


**Figure 6.** Structure of training and testing data for the proposed method.

**Figure 7.** The overall procedure of the proposed load forecasting method for an individual household.

#### **4. Simulation Results**

#### *4.1. Data Description and Hyper-Parameter Tuning*

The proposed method was tested using hourly metering data gathered from 1181 residential households in Seoul, Korea, for one year (August 2016 to July 2017). With this dataset, the results of the proposed method are compared with the results of pooling-based augmentation [7] as well as with the results of other deep-learning models [2,7,20]. For the day-ahead load forecasting, historical load data for the last 30 days were used for the training process, and historical load data of the previous day were used for testing. To evaluate the accuracy of forecasting results, the paper employed the mean absolute percentage error (MAPE) and root mean square error (RMSE) as follows:

$$\text{MAPE} = \frac{1}{T} \sum\_{t=1}^{T} \frac{\left| \mathcal{P}\_{t, \, (D+1)} - p\_{t, \, (D+1)} \right|}{p\_{t, \, (D+1)}} \times 100\% \,\, \, \tag{10}$$

$$\text{RMSE} = \sqrt{\frac{1}{T} \sum\_{t=1}^{T} \left( \left( p\_{t, \, (D+1)} - p\_{t, \, (D+1)} \right) \right)^2} \,. \tag{11}$$

The proposed method is developed and tested through Python with the Keras library, whose backend is Tensorflow [31,32]. To avoid over-fitting in the training process, the parameter settings in Table 1 were tested for the proposed method. The tuning process of hyper-parameters was based on [23]. The numbers of hidden layers were selected based on [33–38]. The common hyper-parameters, such as activation function, optimizer, loss function, etc., are reported in Table 1. The additional more specific parameters of CNN were settled with the size of filter 3 × 3, number of input filters 24, maximum pooling size 3 × 3, and size of strides 1. The convolution layers were followed by a fully connected layer with the rectified linear unit (ReLU) activation functions. The final fully connected layer predicted one-hour electric load at a time, which was matched with [36]. The dataset (30 days) was split into validating set (3 days), testing set (1 day), and training set (26 days).

**Table 1.** Hyper parameters for selected deep learning models.


To tackle the overfitting in CNNs, the proposed method was preliminarily tested with different numbers of hidden layers in CNN architecture. Table 2 shows the forecasting results of ten households with six different hidden layers. The ten single households in Table 2 were randomly selected, and the results were monthly average values in July (peak season in Korea). In most households, the forecasting results with two or three hidden layers were more accurate. With these results, the paper set the number of hidden layers to be 2.

**Table 2.** Load Forecasting results (mean absolute percentage error (MAPE)) with a number of different hidden layers.


#### *4.2. E*ff*ects of Proposed Augmentation Method*

Tables 3 and 4 show the MAPE and RMSE results of day-ahead forecasting with and without the proposed augmentation technique. The ten single households in tested Tables 3 and 4 were randomly selected, and the results were monthly average values in July (peak season in Korea).


**Table 3.** Load Forecasting results (MAPE) with and without proposed augmentation.

**Table 4.** Load Forecasting results (root mean square error (RMSE)) with and without proposed augmentation.


By using the proposed augmentation, the forecasting accuracy was improved by 5 percent, at least, as shown in Table 3. It can be concluded that the proposed forecasting method can significantly improve the forecasting accuracy more than can the other forecasting models without augmentation. When comparing only the cases with the proposed augmentation, the CNN provided more accurate forecasting than the LSTM. It is probably because the augmented data contain plenty of similar random information with small variations. With this information, the CNN received the opportunity for obtaining optimal convergence. On the other hand, as LSTM required more repetitive information for training to improve its performance, this augmented data was comparatively ineffective for LSTM.

Figure 8a,b show the average MAPE and RMSE of ten households, calculated using the results of daily forecasting in July. On most days, the proposed forecasting method significantly improved the forecasting accuracy more than did the other forecasting model without augmentation.

Figure 9 shows the daily average of forecasting results for four higher uncertain households. For most times, the proposed forecasting method significantly improved forecasting accuracy. This is probably because the homogeneous information from the proposed augmentation provides the opportunity for the CNN framework to obtain optimal convergence.

**Figure 8.** Average mean absolute percentage error (MAPE) and root mean square error (RMSE) of ten arbitrary selected households in July: (**a**) MAPE result, and (**b**) RMSE result.

**Figure 9.** Daily average of MAPE of higher uncertain households from Table 3: (**a**) household 10; (**b**) household 8; (**c**) household 9 and (**d**) household 1.

#### *4.3. Forecasting Results in Peak Day*

Figure 10a,b show the forecasting results of the peak day for two selected households. One was a less uncertain household which showed the lowest monthly average MAPE in Table 2 (household 4). The other household was a more uncertain household which showed the highest monthly average MAPE in Table 2 (household 10). The peak loads of the two households in July were 1.062 kW (23:00, July 17) and 2.033 kW (18:00, July 14), respectively. During a peak day, it is expected that the residential load profiles will be highly uncertain so that the load forecasting is more challenge. To test the forecasting accuracy, the results of the proposed forecasting method were compared to the results of the forecasting models using pooling augmentation techniques [7].

**Figure 10.** Day-ahead load forecasting for peak day: (**a**) lower uncertain household (household 4) and (**b**) higher uncertain household (household 10).

For pooling augmentation, the simulation used historical load data of six neighbors as an additional training set. In Figure 10a,b, the predicted load curves of the proposed method were much closer to the actual load profile at most hours of the day, for both households. Especially at the peak time of the day, the proposed forecasting model can provide significantly accurate forecasting results. On the other hand, the forecasting models using the other pooling techniques showed a worse performance for load forecasting at the peak time.

An important day for a forecasting test is the day of maximum energy. During this day, the residential load profiles would be highly uncertain at all hours of the day, so that the load forecasting is more challenging. The daily maximum energy consumption of the selected two households in July was 19.381 kWh (July 21) and 15.797 kWh (July 30), respectively. Figure 11a,b demonstrate the target load profile and predicted load profile of the less uncertain household and the highly uncertain household, respectively. For the less uncertain household, the predicted load curves of the proposed method were much closer to the actual load profile at most hours of the day. The proposed model reported 7.999% of MAPE for the less uncertain household, which was lower than the 10.6566% of MAPE from the pooled LSTM. Similarly, for the highly uncertain household, the proposed model reported 40.4058% of MAPE, which was lower than the 53.319% of MAPE from the pooled LSTM. These results strongly validate the proposed methodology for load prediction in the residential sector.

(**b**)

**Figure 11.** Day-ahead hourly load forecasting for the day of maximum energy consumption: (**a**) less uncertain household (household 4) and (**b**) highly uncertain household (household 10).

#### *4.4. Monthly Results of Day-Ahead Load Forecasting*

To examine the efficacy of the proposed technique for one year, this section tested the performance throughout eleven months by picking one of the best performer households and one of the worst performer households.

Tables 5 and 6 show the monthly average MAPE and RMSE results of the less uncertain household and the highly uncertain household. The test results are from September 2016 to July 2017. In Tables 5 and 6, the proposed method improved the forecasting accuracy by more than 6 percent throughout the year.


**Table 5.** Monthly-average of MAPE for the lower uncertain household.

**Table 6.** Monthly-average of MAPE for the higher uncertain household.


#### *4.5. Impact of Clustering Number K*

Figure 12a,b show the average MAPE and RMSE results of ten households with different clustering numbers *K*. In Figure 12, very accurate forecasting can be expected when the clustering number *K* is increased for most households. However, for some households, the MAPE is increased when the clustering number *K* is over 4. The higher clustering number *K* provides more training data to the CNNs, so that the CNNs have more chance to learn the load characteristics. However, more training data can increase the variations of the training set, which cause overfitting of the hyper-parameters of CNNs, which degrades the optimal learning of CNNs. Therefore, the optimal clustering number must be selected for each household.

**Figure 12.** Effect of clustering number *K* on forecasting accuracy: (**a**) MAPE and (**b**) RMSE.

#### **5. Conclusions**

The paper proposed a forecasting method based on convolution neural networks (CNNs) combined with a data augmentation technique with consideration of an insufficient period of training data. The proposed data augmentation can enlarge the training data for CNNs using only a target household's own historical data, without the help of other households. The proposed forecasting method transforms a time series of a single household load data into several time series of residual loads. With this enragement of training data, forecasting accuracy for individual residential households can be improved. The test results indicated that the proposed method can deliver a notable improvement by including homogeneous information for an individual residential household's load forecasting. The proposed method can be used for energy management at the household level and evaluate the baseline of energy consumption at the household level for demand-response programs.

**Author Contributions:** For research, S.K.A. developed the idea of augmentation strategies for a CNN-based forecasting framework, performed the simulation, and wrote the paper. Y.-M.W. helped organize the article. J.L. provided guidance for research and revised the paper.

**Funding:** This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2017R1C1B5016244). This research was supported by Korea Electric Power Corporation. (Grant number: R18XA04).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Solving the Cold-Start Problem in Short-Term Load Forecasting Using Tree-Based Methods**

#### **Jihoon Moon 1, Junhong Kim 2, Pilsung Kang <sup>2</sup> and Eenjun Hwang 1,\***


Received: 20 January 2020; Accepted: 12 February 2020; Published: 17 February 2020

**Abstract:** An energy-management system requires accurate prediction of the electric load for optimal energy management. However, if the amount of electric load data is insufficient, it is challenging to perform an accurate prediction. To address this issue, we propose a novel electric load forecasting scheme using the electric load data of diverse buildings. We first divide the electric energy consumption data into training and test sets. Then, we construct multivariate random forest (MRF)-based forecasting models according to each building except the target building in the training set and a random forest (RF)-based forecasting model using the limited electric load data of the target building in the test set. In the test set, we compare the electric load of the target building with that of other buildings to select the MRF model that is the most similar to the target building. Then, we predict the electric load of the target building using its input variables via the selected MRF model. We combine the MRF and RF models by considering the different electric load patterns on weekdays and holidays. Experimental results demonstrate that combining the two models can achieve satisfactory prediction performance even if the electric data of only one day are available for the target building.

**Keywords:** short-term load forecasting; building electric energy consumption forecasting; cold-start problem; transfer learning; multivariate random forests; random forest

#### **1. Introduction**

The continuing environmental problems caused by the enormous amount of carbon dioxide produced by the burning of fossil fuels, such as coal and oil, for energy production has resulted in considerable focus on smart grid technologies owing to their effective use of energy [1,2]. A smart grid is an intelligent electric power grid that combines information and communication technology with the existing electric power grid [3]. The smart grid can optimize energy use by sharing electric energy production and consumption information with consumers and suppliers in both directions and in real time [4]. The most fundamental approach for sustainable development of smart grids is electric power generation using renewable energy sources, such as photovoltaic and wind energy [5,6]. Furthermore, an energy management system (EMS) in smart grids requires an optimization algorithm for the advanced operation of an energy storage system (ESS) [7]; it also has to plan various strategies by considering consumer-side decision making [8].

Artificial intelligence (AI) technology-based applications are a highly relevant area for smart grid control and management [6–8]. In particular, short-term load forecasting (STLF) is a core technology of the EMS [9]; moreover, accurate electric load forecasting is required for stable and efficient smart grid operations [10]. From the perspective of a supplier, it is challenging to provide optimal benefits in a cost-effective analysis while storing a large amount of electric energy in the ESS; however, the smart grid can plan effectively by predicting future electric energy consumption and receiving the required energy from internal and external energy sources [11]. It is also possible to optimize the renewable energy generation process [11,12]. From the perspective of a consumer, the EMS can quickly cope with situations such as blackouts and can help to save energy costs because it confirms the electric energy consumption and peak hours during the day [12].

Electric energy consumption patterns are complicated according to the types of buildings [13]; moreover, the electric energy consumption is frequently changed owing to uncertain external factors [14]. Therefore, it is challenging to predict the exact electric energy consumption in buildings [15]. Besides, when forecasting electric energy consumption, the complex correlations associated with an electric load between the current time and the previous time should be appropriately considered [7,11]. To adequately reflect previously uncertain external factors and electric energy consumption, AI techniques can be used to predict future building electric energy consumption based on diverse information, such as historical electric loads, locations, populations, weather factors, and events [16]. Moreover, the importance of multistep-ahead electric load forecasting has increased to quickly determine new uncertainties in power systems [17].

Most AI techniques use large amounts of data to construct STLF models. However, as sufficient electric load data of buildings connected to smart grids for a short time or new/renovated buildings are not collected, it is challenging to construct STLF models using these data sets. We defined this problem of lack of data as a cold-start problem. The cold-start problem [18] can occur in computer-based information systems that require automated data modeling. In particular, this problem involves the issue where the system cannot derive inferences from insufficient information regarding users or items. In future, because of the expansion of the smart grid market, it is expected that new data sets will be collected from newly constructed or renovated buildings. Hence, EMSs require a novel building electric energy consumption forecasting model that can be applied to these buildings.

In this paper, we propose a novel STLF model that combines random forest (RF) models while considering two cases (i.e., weekdays and holidays) to solve the cold-start problem. To achieve this, we first collected sufficient electric energy consumption data sets from 15 buildings. The collected data sets were divided into training and test sets; moreover, we developed a transfer learning-based STLF model based on multivariate random forests (MRF) in the training set. We also constructed a RF-based STLF model using the building electric energy consumption data of only 24 h and then combined the two models by considering the schedule. Consequently, we assumed the building electric energy consumption for only 24 h in the test set and performed multistep-ahead hourly electric load forecasting (24 points) of the target building to prepare for uncertainty.

The rest of this paper is organized as follows: in Section 2, we review several STLF models based on AI techniques using sufficient and insufficient data sets, respectively. In Section 3, we describe the input variable configuration for STLF models. In Section 4, we describe the RF-based STLF model construction in detail. Section 5 presents and discusses the experimental results to evaluate the prediction performance of the proposed model. In Section 6, we provide a conclusion and future research directions.

#### **2. Related Works**

In this section, we introduce the research on electric energy consumption forecasting for buildings with and without sufficient data sets. Table 1 summarizes the information about the selected papers, and these studies are described in detail subsequently.

Several studies have predicted electric energy consumption for buildings with sufficient data sets based on traditional machine-learning and deep-learning (DL) methods. Candanedo et al. [19] proposed data-driven prediction models for a low-energy house using the data from home appliances, lighting, weather conditions, time factors, etc. They used multiple linear regression (MLR), support vector regression (SVR), RF, and gradient boosting machine (GBM) to construct 10 minute-interval electric energy consumption forecasting models. They confirmed that time factors were essential variables to build the prediction models; moreover, the GBM model exhibited better prediction

performance than other models. Wang et al. [20] presented an hourly electric energy consumption forecasting model for two institutional buildings based on RF. They considered time factors, weather conditions, and the number of occupants as the input variables of the RF model. They compared the prediction performance of the RF model with that of the SVR model and confirmed that the RF model presented better prediction performance than the SVR model. Li et al. [21] proposed an extreme stacked autoencoder (SAE), which combined the SAE with an extreme learning machine (ELM) to improve the prediction results of building energy consumption. The electric energy consumption data were collected from one retail building in Fremont, CA. The authors predicted the building electric energy consumption at 30 min and 60 min intervals and compared the prediction performance of their proposed model with that of backward propagation neural network (BPNN), SVR, generalized radial basis function neural network, and MLR. Their proposed model demonstrated better prediction performance than other models. Almalaq et al. [22] presented a hybrid prediction model based on a genetic algorithm (GA) and long short-term memory (LSTM). GA was employed to optimize the window size and the number of hidden neurons for the LSTM model construction. Their proposed model predicted two public data sets of residential and commercial buildings and compared the prediction performance with autoregressive integrated moving average (ARIMA), decision tree (DT), k-nearest neighbor, artificial neural network (ANN), GA-ANN, and LSTM models. They confirmed that their proposed model exhibited better prediction performance than other models.

Several studies reported the construction of electric energy consumption forecasting models for buildings with insufficient data sets based on pooling, transfer learning, and data generation. Shi et al. [23] proposed a household electric load forecasting model using a pooling-based deep recurrent neural network (PDRNN). The pooling method was used to overcome the limitation of the complexity of household electric loads, such as volatility and uncertainty. Then, LSTM with five layers and 30 hidden units in each layer was employed to build an electric load-forecasting model using the pooling method. They compared the prediction performance of the PDRNN with that of ARIMA, SVR, recurrent neural network (RNN), etc. and confirmed that their proposed method outperformed ARIMA by 19.5%, SVR by 13.1%, and RNN by 6.5% in terms of the root mean square error. Ribeiro et al. [24] proposed a transfer-learning method, called Hephaestus, for cross-building electric energy consumption forecasting based on time-series multi-feature regression with seasonal and trend adjustments. Hephaestus was applied in the pre- and post-processing phases; then, standard machine learning algorithms such as ANN and SVR were used. This method adjusted the electric energy consumption data from various buildings by removing the effects of time through time-series adaptation. It also provided time-independent features through non-temporal domain adaptation. The authors confirmed that Hephaestus can improve electric energy consumption forecasting for a building by 11.2% by using additional electric energy consumption data from other buildings. Hooshmand and Sharma [25] constructed a transfer learning-based electric energy consumption forecasting model in small data set regimes. They collected publicly available electric energy consumption data and classified different types of customers. Then, normalization was utilized for training the trends and seasonality of time series efficiently. They built a convolutional neural network (CNN) architecture through the pre-training step that learns from a public data set with the same type of buildings as the target building. Subsequently, they retrained only the last fully connected layer using the data set of the target building to predict the energy consumption data of the target building. They demonstrated that their proposed model consistently demonstrated lower prediction performance error when compared to seasonal ARIMA, fresh CNN, and pre-trained CNN models. Tian et al. [26] proposed parallel building electric energy consumption forecasting models based on generative adversarial networks (GANs). They initially generated the parallel data through GAN by using a small number of the original data sets and then configured the mixed data set, which included the original data and the parallel data. Finally, they utilized the mixed data set to train several machine learning algorithms such as BPNN, ELM, and SVR. Experimental results exhibited that the parallel data consisted of similar distributions of the original data, and the prediction models trained by the mixed data set

demonstrated better prediction performance than those trained using the original data, information diffusion technology, heuristic mega-trend-diffusion, and bootstrap methods.

The differences between the methods described above and our method are as follows:

The forecasting models in previous studies [19–22] presented excellent prediction performance using a sufficient data set of target buildings. However, our forecasting model can exhibit a satisfactory prediction performance even if the electric load data of the target building is insufficient.

The forecasting models in previous studies [19–24,26] could not predict the electric loads for various buildings. However, our forecasting model predicts the electric loads for 15 buildings; consequently, it can be considered as a generalized forecasting model.

The previous studies based on DL techniques [21–23,25,26] demonstrate a certain amount of computational cost to optimize the various hyperparameters of DL models for the target building. We use the RF with minimal tuning of hyperparameters to construct satisfactory forecasting models for several buildings.

The previous studies based on transfer learning [24,25] considered the types of the building to construct the forecasting models. However, we built a transfer learning-based forecasting model even without knowledge of the type of buildings. Besides, even if the electric load of only 24 h for the target building is known, our forecasting model can predict multistep electric load forecasting.

**Table 1.** Summary of several approaches for building electric energy consumption forecasting (MLR: multiple linear regression, SVR: support vector regression, GBM: gradient boosting machine, RF: random forest, SAE: stacked autoencoder, ELM: extreme learning machine, GA: genetic algorithm, LSTM: long short-term memory, PDRNN: pooling-based deep recurrent neural network, ANN: artificial neural network, CNN: convolutional neural network, GAN: generative adversarial network, BPNN: backward propagation neural network).


#### **3. Input Variable Configuration**

Our goal is to predict the electric energy consumption of buildings that were recently built. To address this issue, we constructed two STLF models, one using the sufficient electric load data of other buildings and the other using the limited electric load data of the target building. In this section, we describe the construction of input variables for practical model training. Section 3.1 describes the electric energy consumption data sets of the 15 buildings that we collected. Section 3.2 shows the input variable configuration for the RF-based forecasting model using the small data set of the target building. Section 3.3 exhibits the input variable configuration for the transfer learning-based forecasting models using the sufficient data sets from different buildings.

#### *3.1. Data Sets of Electric Energy Consumption from Buildings*

We randomly received the hourly electric energy consumption data collected from smart meters connected with 15 sites from the Korea Electric Power Corporation (KEPCO). One smart meter, which is installed per site, usually represents one building or a couple of connected buildings. In this paper, we defined a smart meter as a building. The period of data collection was from 1 January 2015 to

31 July 2018. As the collected building data were provided under anonymity and because of the de-identification owing to privacy, we could not determine the types, characteristics, and locations of the buildings. The collected smart meter data had an average missing value rate of 0.6%; we imputed these missing values using linear interpolation. The statistical analysis of the collected smart meter data for each building is listed in Table 2.


**Table 2.** Statistics of electric energy consumption data from each building (Unit: kWh).

Herein, we made one assumption. Even though we have sufficient electric energy consumption data sets for other buildings, we have the electric energy consumption data set of only 24 h for the target building. Thus, as we know the electric load of the target building for only 24 h (24 points), we predicted the electric energy consumption for the subsequent 24 h.

#### *3.2. Case 1: Time Factor-Based Forecasting Modeling*

As mentioned above, we assumed that the electric energy consumption data of only 24 h for the target building was known. Hence, we explain the input variable configuration for the prediction model utilizing the electric energy consumption data of only 24 h. Generally, while constructing a STLF model, various factors, such as time factors, weather information, and historical electric loads, are reflected as input variables [27,28]. However, we cannot utilize weather information as an input variable because we do not know the location of the buildings. The historical electric load is used to reflect recent trends and patterns [28,29]; moreover, it can be applied when it is composed of sufficient data sets. However, as we assumed that the historical electric load is composed of data of only 24 h, reflecting on the historical electric load is inappropriate. This implies that we do not know any recent trends or patterns. In the case of time factors, there are various factors, namely month, day, hour, day of the week, and holiday. However, the data set was too small to effectively reflect the characteristics of months, days, days of the week, and holidays. Therefore, we considered only "hour" as an input variable to construct the STLF model.

$$Hour\_{\chi} = \sin\left(\left(\frac{360}{24}\right) \times Hour\right) \tag{1}$$

$$\left(Hour\_y = \cos\left(\frac{360}{24}\right) \times Hour\right) \tag{2}$$

However, in the case where 11 pm and 12 am of the subsequent day are adjacent, but the difference is 23, it is challenging to train the input variable of the STLF model effectively. Thus, we use Equations (1) and (2) to enhance the sequence data in the one-dimensional space to the continuous data in the two-dimensional space [11,30]. These variables can more adequately reflect the continuous characteristic, similar to the shape of the clock, as shown in Figure 1.

**Figure 1.** Example of the enhancement of one-dimensional space to two-dimensional space.

Tables 3 and 4 lists the results of the statistical analysis of the building electric energy consumption for one-dimensional and two-dimensional spaces. In Table 3, we can observe that the two-dimensional space reflects the building electric energy consumption more effectively than the one-dimensional space. In Table 4, we can see that almost of *p*-values of the F-statistic are <sup>&</sup>lt; 2.2 <sup>×</sup> 10−16, which are highly significant. This means that at least, one of the predictor variables is significantly related to the outcome variable. We used 24-hour information to construct the STLF model and then applied the same time information to predict the building electric energy consumption over the next 24 h.


**Table 3.** R-squared statistics and standard error for each building.


**Table 4.** F-statistics and *p*-value for each building.

#### *3.3. Case 2: Transfer Learning-Based Forecasting Modeling*

In addition to the electric energy consumption data set of the target building for only 24 h, we also have sufficient electric energy consumption data sets of other buildings. Hence, we can reflect on various characteristics to use as input variables for a transfer learning-based model construction. Consequently, we used a time factor and historical electric load as input variables. We first divided the electric energy consumption data of all buildings into training and test sets. In the training set, we used the electric energy consumption data of different buildings to train the transfer learning-based STLF models and these models were applied to the electric energy consumption data of the target building when it was used in the test set. For time factors, we used the month, week, day, hour, day of the week, and holiday information. In the case of months, weeks, days, and days of the week, we enhanced the time factors from the one-dimensional space to two-dimensional space, as shown in Equations (3) to (10) [30,31]. Here *WN* is the week number based on the ISO 8601 standard [32], and *LDM* represents the last day of the month.

$$Day\_x = \sin\left(\left(\frac{360}{LDM}\right) \times Day\right) \tag{3}$$

$$Day\_{\mathcal{Y}} = \cos\left(\left(\frac{360}{LDM}\right) \times Day\right) \tag{4}$$

$$\text{Weck}\_x = \sin\left(\frac{360}{\text{WN}}\right) \times \text{Week}\,\tag{5}$$

$$\text{Weck}\_{\mathcal{Y}} = \cos\left(\frac{360}{\text{WN}}\right) \times \text{Weck}\right) \tag{6}$$

$$Month\_{\mathcal{X}} = \sin\left(\frac{360}{12}\right) \times Month\,\right) \tag{7}$$

$$Month\_Y = \cos\left(\frac{360}{12}\right) \times Month\right) \tag{8}$$

$$Day\\_of\\_the\\_Week\_X = \sin\left(\left(\frac{360}{7}\right) \times Day\\_of\\_the\\_Week\right) \tag{9}$$

$$Day\\_of\\_the\\_Week\_Y = \cos(\left(\frac{360}{7}\right) \times Day\\_of\\_the\\_Week\,\tag{10}$$

The holidays included Saturdays, Sundays, and national holidays and exhibited predominantly low electric energy consumption during working hours, unlike working days [33]. In the case of

holidays, we used one-hot encoding to reflect the property as "1" on the holiday and "0" otherwise. Hence, we applied nine time factors as input variables at the prediction time point.

In addition, we applied not only the historical electric load data but also the day of the week and holiday, which are a part of the data, to reflect both the historical electric load of the building and its characteristics. Consequently, we configured 15 input variables at the prediction point time. Table 5 lists the information on these variables; where *D* represents the day.


**Table 5.** List of input variables for the transfer learning-based forecasting model.

As our goal is to predict all-time points 24 h later, we constructed all the input variables by utilizing each input variable for the prediction time point. Thus, we used a total of 360 input variables, i.e., (15 (number of input variables) × 24 (prediction time points)) for the STLF model construction, as shown in Figure 2.

**Figure 2.** Input variable configuration for the transfer learning-based forecasting model.

#### **4. Forecasting Model Construction**

In this section, we describe the STLF model construction using the limited data set of the target building and the data sets of other buildings; moreover, we also present the method to select the prediction value derived from the transfer learning-based model. In addition, we combined the two STLF models and thus present a total of 15 STLF models. Figure 3 illustrates a brief system architecture of our proposed model.

**Figure 3.** Overall system architecture.

#### *4.1. Case 1: Time Factor-Based Forecasting Model*

To construct an STLF model using only "hour" as an input (independent) variable, we used one statistical technique and two machine learning algorithms. Even though SVM, DL, and boosting methods exhibit excellent prediction performance in STLF [7–9,34–37], they require a significant amount of time to optimize the various hyperparameters and also require sufficient data sets. We did not consider these methods because we constructed an STLF model using a data set from the building electric energy consumption data of only 24 h. Thus, we considered MLR, DT, and RF, which allow simple model construction and exhibit satisfactory prediction performance [38,39]. We used two-time factors, namely, *Hourx* and *Houry*, as independent variables and the electric energy consumption for the target building as the dependent (output) variable. Therefore, we predicted multistep-ahead hourly electric loads using the time factor-based forecasting model via a sliding window time series analysis, as shown in Figure 4.

**Figure 4.** Example of multistep-ahead electric load forecasting via a sliding window method.

#### 4.1.1. Multiple Linear Regression (MLR)

MLR is a common statistical technique that is widely used in many STLF models [39]. MLR [39] analyzes the relationship between a continuous dependent variable and one or more independent variables. The value of the dependent variable can be predicted using a range of values of the independent variables based on an identity function that describes, as closely as possible, the relationship between these variables. In addition, MLR determines the overall fit of the prediction model and the relative contribution of each independent variable to the total variance. Equation (11) represents the method to construct the MLR-based STLF model. *Yi* is the forecast energy consumption at time *i* and β<sup>0</sup> is the intercept of population *Y*. β<sup>1</sup> and β<sup>2</sup> are population slope coefficients. By defining the weights of the MLR model based on β<sup>1</sup> and β<sup>2</sup> at the prediction time, we can construct a more sophisticated STLF model. Herein, β0, β1, and β<sup>2</sup> are calculated when the prediction model is built by using the electric loads at the previous one day from the prediction points. Because our model focuses on the prediction of multistep electric loads using a sliding window method, the weights of β0, β1, and β<sup>2</sup> are adjusted every hour.

$$Y\_i = \beta\_0 + \beta\_1 \times Hom\_x + \beta\_2 \times Hom\_y \tag{11}$$

#### 4.1.2. Decision Tree (DT)

A DT [40] is used to construct classification or regression models in the form of a tree structure. It separates a data set into smaller subsets while an associated DT is being incrementally extended. The final result is a tree with a decision node that has two or more branches and a leaf node that denotes a classification or decision. The topmost decision node in a tree corresponds to the best predictor, called root node. DTs can present a higher explanatory power because they determine the independent variables that have a more powerful impact when predicting the values of the target variable [41]. However, continuous variables (i.e., building electric energy consumption) used in the prediction of the time series are considered as discontinuous values. Hence, prediction errors are likely to occur near the boundary of separation.

#### 4.1.3. Random Forest (RF)

RF [41] is an ensemble learning method that combines different DTs that classify a new instance by the majority vote. Each DT node utilizes a subset of attributes randomly selected from the original set of characteristics. As RF can handle large amounts of data effectively, it exhibits a high prediction performance in the field of STLF [14]. Besides, when compared to other AI techniques, such as DL and gradient-boosting algorithms, RF requires less fine tuning of its hyperparameters [42]. The basic hyperparameters of RF include the number of trees to grow (nTree) and the number of variables randomly sampled as candidates at each split (mTry). The correct value for nTree is usually not of much concern because increasing the number of trees in the RF raises the computational cost and does not contribute significantly to prediction performance improvement [14,41]. However, it is interesting to note that picking too small a value of mTry can lead to overfitting. We consider only two input variables, mTry and nTree, which are set to 2 and 128 [43], respectively.

#### *4.2. Case 2: Transfer Learning-Based Forecasting Model*

Transfer learning [44] is a process that utilizes the knowledge gained while solving one issue to a different but related issue. To achieve this, a base machine-learning method is first trained on a base data set or task. Then, the method repurposes or transfers the learned features to the second target data set or task. This process can operate if the features are general and the intentions are suitable for both base and target tasks rather than being specific to the base task. We previously configured the input variable for transfer learning-based STLF model construction. We first divided the training and test sets. In the training set, we used electric energy consumption data of other buildings to train the transfer learning-based forecasting model. Then, the model predicted the multistep electric load for the target building using its input variables in the test set. Finally, we selected appropriate prediction results for the target building from the other 14 forecasting models. We describe the details in the subsections below.

#### 4.2.1. Multivariate Random Forests (MRF)

Neural network techniques have been used to predict multistep electric loads [45–47]. However, they require scaling (e.g., robust normalization, standardization, and min-max normalization) before building a prediction model [11]. Thus, when the transfer learning-based STLF model is applied to the target building, the electric load of the target building can exhibit a different range than the electric load of other buildings. Consequently, if STLF models that are trained by scaling are used to predict future electric load for the target building, it is challenging to accurately apply them into the electric load range of the target building.

Consequently, we used MRF (when the number of output features>one) for transfer learning-based STLF model construction. MRF [48] can provide multivariate outcomes and can generate an ensemble of multivariate regression trees through bootstrap resampling and predictors subsampling for univariate RF. In MRF, node cost is measured as the sum of squares of the Mahalanobis distance, whereas in univariate trees (i.e., RF), the node cost is measured as the Euclidean distance (ED) [49]. MRF can provide excellent prediction performance without scaling for multiple output predictions [50,51]. Therefore, we used the input variables without scaling to train the MRF models and predicted the electric load of the target building using the input variable of the target building without scaling. To construct the MRF models, we set mTry and nTree as 120 (number of input variables/3) and 128, respectively [14]. Thus, when a total of 14 prediction results are derived from each MRF model, we did not use all of them and instead selected only one prediction result that exhibits the most similar time series through the similarity analysis between the target building and the other buildings. Then, we used the prediction result as the model that demonstrates the most similar time series to predict the electric load of the target building. Here, we consider a total of three techniques and describe the details in the subsections below.

#### 4.2.2. Similarity Measures

We used three similarity measures, i.e., Pearson correlation coefficient (PCC), cosine similarity (CS), and ED, to analyze the similarity of the time series between the target building and other buildings. These methods are commonly used for similarity analysis [13].

PCC [52], which is a measure of the linear correlation between two variables, *x* and *y*, is the covariance of these variables divided by the product of the standard deviations in the data of equal or proportional scales. It has a value between +1 and −1, according to the Cauchy–Schwarz inequality, where +1 is a perfect positive linear correlation, 0 indicates no linear correlation, and −1 is a perfect negative linear correlation. For the paired data " (*x*1, *y*1), ··· ,(*xn*, *yn*) # consisting of *n* pairs, *rxy*, which is a substituting estimation of the covariances and variances based on a sample, is defined according to Equation (12). Here, *n* is the sample size, *xi* and *yi* are the individual sample points indexed with time *i*. *x* and *y* are the sample means of *x* and *y*, respectively. For our experiments, because we considered hourly electric load for only 24 h, *n* is 24. *xi* and *yi* are the hourly electric loads indexed with *i* of the target and different buildings, respectively. Herein, we apply the prediction model of the building whose PCC is closest to one by comparing the target building with other buildings.

$$\sigma\_{xy} = \frac{\sum\_{i=1}^{n} (\mathbf{x}\_i - \overline{\mathbf{x}}) \times (y\_i - \overline{y})}{\sqrt{\sum\_{i=1}^{n} (\mathbf{x}\_i - \overline{\mathbf{x}})^2 \times \sum\_{i=1}^{n} (y\_i - \overline{y})^2}} \tag{12}$$

CS [53] indicates the similarity between vectors measured using the cosine of the angle between two vectors in the inner space. The cosine when the angle is 0◦ is one, and the cosine of all other angles is smaller than one. Therefore, this value is used to determine the similarity of the direction, not the magnitude of the vector. The value is +1 when the two vectors are in exactly the same direction. The value is 0 when the angle is 90◦, and the value is −1 when the vectors are in completely opposite directions, i.e., an angle of 180◦. At this time, the size of the vector does not affect the value. The CS can be applied to any number of dimensions and is often used to measure similarity in a multidimensional amniotic space. Given the vector values of the attributes *A* and *B*, CS (cos(θ)) can be expressed by the scalar product and magnitude of the vector, as shown in Equation (13). For our experiments, as mentioned earlier, *n* is 24. *Ai* and *Bi* are the hourly electric loads indexed with *i* of the target and different buildings, respectively. Herein, we apply the prediction model of the building whose CS is closest to one by comparing the target building with other buildings.

$$\text{similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum\_{i=1}^{n} A\_i \times B\_i}{\sqrt{\sum\_{i=1}^{n} (A\_i)^2} \times \sqrt{\sum\_{i=1}^{n} (B\_i)^2}} \tag{13}$$

ED [54,55] is a common method for calculating the distance between two points. This distance can be used to define the Euclidean space, and the corresponding norm is called the Euclidean norm. A generalized term for the Euclidean norm is the *L*<sup>2</sup> norm or *L*<sup>2</sup> distance [55]. In a Cartesian coordinate system, where there are points *p* = (*p*1, *p*2, ··· , *pn*) and *q* = (*q*1, *q*2, ··· , *qn*) in Euclidean *n* space, the distance between two points *p* and *q* is calculated using two Euclidean norms, which is defined according to Equation (14). The distance between any two points on the real line is the absolute value of the numerical difference of their coordinates. It is common to identify the name of a point with its Cartesian coordinate. When the two points *p* and *q* are the same, the distance value is zero. For our experiments, as mentioned earlier, *n* is 24. *pi* and *qi* are the hourly electric loads indexed with *i* of the target and different buildings, respectively. Herein, we apply the prediction model of the building whose ED is closest to zero by comparing the target building with other buildings.

$$d(p,q) = d(q,p) = \sqrt{(q\_1 - p\_1)^2 + (q\_2 - p\_2)^2 + \dots + (q\_n - p\_n)^2} = \sqrt{\sum\_{i=1}^n (q\_i - p\_i)^2} \tag{14}$$

#### *4.3. Case 3: Combining Short-Term Load-Forecasting Models*

We considered the following situations to apply more suitable forecasting models for each case. As mentioned above, the electric load patterns vary significantly depending on the days of the week and holidays [32]. For instance, in chronological sequence, the difference in electric loads between Friday and Saturday is large. The difference in electric loads between Sunday and Monday is also large. Therefore, if the time factor-based forecasting model predicts the electric load on the weekend by using the electric load on a weekday, the forecast value presents multiple error rates because it exhibits the weekday pattern. In addition, when constructing a time factor-based forecasting model by using only a low electric load, such as that on Sunday or a national holiday, it can cause high error rates because of a high electric load on Monday. To address these issues, we combined the time factorand transfer-learning-based forecasting models. We applied the prediction models by considering two cases at the 24 prediction points and illustrated examples of the electric load forecasting using the combined STLF model in Figure 5 and hence we constructed a total of 15 STLF models, as shown in Table 6.


(d) Using only Case 2 (holidays).

**Figure 5.** Examples of multistep-ahead electric load forecasting using the combined short-term load forecasting (STLF) model by considering two cases.


**Table 6.** Construction of various forecasting models (MLR: multiple linear regression, DT: decision tree, RF: random forest, MRF: multivariate random forests, PCC: Pearson correlation coefficient, CS: cosine similarity, ED: Euclidean distance).

#### **5. Experimental Results**

#### *5.1. Performance Evaluation Metric*

To evaluate the prediction performance of the proposed model, we used the mean absolute percentage error (MAPE), root mean square error (RMSE), and mean absolute error (MAE). The MAPE value usually presents accuracy as a percentage of the error and can be easier to comprehend than the other statistics because this number is a percentage [11,40]. The MAPE can be defined according to Equation (15). The RMSE is used to aggregate the residuals into a single measure of predictive ability, as shown in Equation (16). The RMSE is the square root of the variance, which denotes the standard error. The MAE is used to evaluate how close forecast or prediction values are to the actual observed values, as shown in Equation (17). The MAE is calculated by averaging the absolute differences between the prediction values and the actual observed values. The MAE gives the average magnitude of forecast error, while the RMSE gives more weight to the most significant errors. Lower values of MAPE, RMSE, and MAE indicate better prediction performance of the forecasting model. Here, *n* is the number of observations and *At* and *Ft* are the actual and forecast values, respectively.

$$MAPE = \frac{1}{n} \sum\_{t=1}^{n} \left| \frac{A\_t - F\_t}{A\_t} \right| \times 100\tag{15}$$

$$RMSE = \sqrt{\frac{\sum\_{t=1}^{n} \left(F\_t - A\_t\right)^2}{n}} \tag{16}$$

$$MAE = \frac{1}{n} \sum\_{t=1}^{n} |F\_t - A\_t| \tag{17}$$

#### *5.2. Prediction Performance Evaluation*

To evaluate the performance of the forecasting models, we conducted the experiments with an Intel ®Core™ i7-8700k CPU with 32GB DDR4 RAM and preprocessed the datasets in RStudio version 1.1.453 with R version 3.5.1. We also carried out the construction of the forecasting models using *'tree'* (DT), *'randomForest'* (RF), and *'MultivariateRandomForest'* (MRF) packages [49,56,57].

As we collected the electric energy consumption data from a total of 15 buildings, we used 15-fold cross-validations to evaluate the prediction performance. In the collected data, the periods of the training and test sets are from 1 January 2015 to 31 July 2017 and from 1 August 2017 to 31 July, 2018, respectively.

We reported training and testing times of the MRF models for each building in Table 7. The training time represents the time to train the MRF model in each building, and the testing time indicates the time for performing all predictions from the three transfer learning-based forecasting models (i.e., M04, M05, and M06).


**Table 7.** Running times of the multivariate random forest (MRF) model in each building (Unit: second).

Because MAPE is a widely used error measurement metric in the electric load-forecasting literature, we presented all the results of multistep-ahead hourly electric load forecasting accuracy using the MAPE in Tables 8–23. We also exhibited the average forecasting accuracy of multistep electric loads using RMSE and MAE results in Tables 24 and 25, respectively.

In Tables 8–25, a cooler color (blue) indicates lower MAPE, RMSE, and MAE values, while a warmer color (red) indicates higher MAPE, RMSE, and MAE values. To confirm the overall prediction performance of the forecasting models, we presented the average MAPE of different forecasting models and indicated the best accuracy in bold. In addition, a box plot for each forecasting model is shown in Figures 6–20 using MAPE values for each prediction point. This means that the box, which is located below and exhibits the smaller range, is a more stable forecasting model.

As shown in Tables 8–25 and Figures 6–20, the RF demonstrated the best prediction performance in the time factor-based forecasting models, and MRF\_ED showed the best prediction performance in the transfer learning-based forecasting models. The reason why ED demonstrated the best performance is only because the distance of electric energy consumption between buildings was considered, unlike the PCC or CS; thus, it can reflect a similar range of electric energy consumption adequately when training the transfer learning-based forecasting model. Consequently, we can observe that M15 demonstrated better prediction performances than other forecasting models in most experiments. M15 is appropriate for solving the cold-start problem in STLF and used two tree-based methods; we called this model as SPROUT (solving cold start problem in short-term load forecasting using tree-based methods).


**Table 8.** Mean absolute percentage error (MAPE) comparison of forecasting models for Building 1 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 6.** Box plot of the mean absolute percentage error (MAPE) of forecasting models for Building 1.


**Table 9.** MAPE comparison of forecasting models for Building 2 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 7.** Box plot of the MAPE of forecasting models for Building 2.


**Table 10.** MAPE comparison of forecasting models for Building 3 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 8.** Box plot of the MAPE of forecasting models for Building 3.


**Table 11.** MAPE comparison of forecasting models for Building 4 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 9.** Box plot of the MAPE of forecasting models for Building 4.


**Table 12.** MAPE comparison of forecasting models for Building 5 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 10.** Box plot of the MAPE of forecasting models for Building 5.


**Table 13.** MAPE comparison of forecasting models for Building 6 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 11.** Box plot of the MAPE of forecasting models for Building 6.


**Table 14.** MAPE comparison of forecasting models for Building 7 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 12.** Box plot of the MAPE of forecasting models for Building 7.


**Table 15.** MAPE comparison of forecasting models for Building 8 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 13.** Box plot of the MAPE of forecasting models for Building 8.


**Table 16.** MAPE comparison of forecasting models for Building 9 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 14.** Box plot of the MAPE of forecasting models for Building 9.


**Table 17.** MAPE comparison of forecasting models for Building 10 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 15.** Box plot of the MAPE of forecasting models for Building 10.


**Table 18.** MAPE comparison of forecasting models for Building 11 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 16.** Box plot of the MAPE of forecasting models for Building 11.


**Table 19.** MAPE comparison of forecasting models for Building 12 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 17.** Box plot of the MAPE of forecasting models for Building 12.


**Table 20.** MAPE comparison of forecasting models for Building 13 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 18.** Box plot of the MAPE of forecasting models for Building 13.


**Table 21.** MAPE comparison of forecasting models for Building 14 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 19.** Box plot of the MAPE of forecasting models for Building 14.


**Table 22.** MAPE comparison of forecasting models for Building 15 (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Figure 20.** Box plot of the MAPE of forecasting models for Building 15.


**Table 23.** Average MAPE comparison of forecasting models (a cooler color indicates a lower MAPE value, while a warmer color indicates a higher MAPE value).

**Table 24.** Average root mean square error (RMSE) results of 15 forecasting models for each building (a cooler color indicates a lower RMSE value, while a warmer color indicates a higher RMSE value).


To demonstrate the superiority of the SPROUT model, we performed several statistical tests, such as Wilcoxon signed-rank and Friedman tests [58,59]. The Wilcoxon signed-rank test [58] is used to confirm the null hypothesis to determine whether there is a significant difference between two models. If the p-value is less than the significance level, the null hypothesis is rejected, and the two models are judged to have significant differences. The Friedman test [59] is a multiple comparison test that aims to identify significant differences between the results of two or more forecasting models. To verify the results of the two tests, we used all the MAPE values (15 (number of the buildings) × 24 (prediction time points)) for each forecasting model. The results of the Wilcoxon test with the significance level set to 0.05 and the Friedman test are listed in Table 26. We can observe that the proposed SPROUT model significantly outperforms the other models because the p-value in all cases is below the significance level.


**Table 25.** Average mean absolute error (MAE) results of 15 forecasting models for each building (a cooler color indicates a lower MAE value, while a warmer color indicates a higher MAE value).

**Table 26.** Results of the Wilcoxon and Friedman tests with SPROUT (solving cold start problem in short-term load forecasting using tree-based methods).


#### *5.3. Discussion*

The SPROUT model demonstrated the best performance in the majority of experiments, excluding certain buildings. Thus, we analyzed these cases in detail. In Figure 21, we can observe that Buildings 3, 5, 6, 9, and 10 illustrated no significant difference in weekday and weekend electric loads. As shown in Table 2 and Figure 21f, Building 12 was unable to reflect similar electric load patterns owing to the wide range of electric energy consumption data. Hence, the transfer learning-based forecasting models demonstrated unsatisfactory prediction performance. Therefore, despite the differences in weekday and weekend patterns, the time factor-based forecasting models showed better prediction performance.

Building 13 showed that M10 to M12 presented better prediction performance than other models because MLR could predict the building electric energy consumptions better than DT and RF. In addition, as listed in Table 20, Building 13 demonstrated a wide range of electric energy consumption data and also showed the highest electric energy consumption. Therefore, even when the ED was close, it was challenging to accurately derive the results of the transfer learning-based forecasting model. The electric load patterns of Building 15 were considerably similar to those of Building 8, as shown in

the *F* test in Table 27. Therefore, as M06 demonstrated accurate predictions on both weekdays and weekends, it exhibited the lowest prediction error rate.


**Table 27.** *F* test between Building 8 and Building 15.

(c) Building 6

**Figure 21.** *Cont*.

**Figure 21.** Box plot of electric loads according to each day of the week.

#### **6. Conclusions**

When sufficient building electric energy consumption data are not available as in newly constructed or renovated buildings, it is difficult to train and construct superior STLF models. Nevertheless, electric load-forecasting models are required for efficient power management, even by considering such limited data sets. In this paper, we proposed a novel STLF model, called SPROUT, to predict electric energy consumption for buildings with limited data sets by combining time factor- and transfer learning-based forecasting models. We used MRF to construct transfer learning-based STLF models for each building by using sufficient building electric energy data and selected the model, which exhibited the most similar time-series pattern to predict the electric load of the target building. We also constructed STLF models based on RF by using the building electric energy consumption data of only 24 h and then used the two models depending on weekdays and holidays. To verify the validity and applicability of our model, we used MLR and DT to construct time factor-based forecasting models and compared their prediction performance. The experimental results showed that the RF-based

STLF model presented a better prediction performance in the time factor-based forecasting model, and the MRF-based STLF model, by applying ED, exhibited a better prediction performance in the transfer learning-based forecasting model. By combining these models, our model (SPROUT) presented excellent prediction performances in MAPE, RMSE, and MAE results. The SPROUT showed an average MAPE value of 11.2 in the experiments and exhibited more accurate prediction performances of 5.9%p (MLR), 6.7%p (DT), and 4.6%p (RF) than time factor-based STLF models. It also showed more accurate prediction performances of 15.6%p (MRF\_PCC), 16.9%p (MRF\_CS), and 2.6%p (MRF\_ED) than transfer learning-based STLF models. We demonstrated that the SPROUT can achieve better prediction performance than other forecasting models.

However, when electric load exhibited no significant difference in weekday and weekend electric loads in the building, the time factor-based STLF models outperformed our model. In addition, the transfer learning-based STLF models presented unsatisfactory prediction performance in the building with the highest electric energy consumption and hence our model cannot perform accurate electric load forecasting. To address these issues, we plan to consider additional electric load data over a period of time for performing weekly electric load pattern analysis and data normalization.

**Author Contributions:** All authors have read and agree to the published version of the manuscript. Conceptualization, J.M.; methodology, J.M. and J.K.; software, J.M.; validation, J.M., J.K., and P.K.; formal analysis, P.K. and E.H.; data curation, J.M. and J.K.; writing—original draft preparation, J.M.; writing—review and editing, E.H.; visualization, J.K.; supervision, E.H.; project administration, P.K. and E.H.; funding acquisition, P.K. and E.H.

**Funding:** This research was supported in part by the Korea Electric Power Corporation (grant number: R18XA05) and in part by Energy Cloud R&D Program (grant number: 2019M3F2A1073179) through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT.

**Acknowledgments:** We greatly appreciate the anonymous reviewers for their comments and suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Load Nowcasting: Predicting Actuals with Limited Data**

#### **Florian Ziel**

House of Energy Markets and Finance, University of Duisburg-Essen, 45141 Essen, Germany; florian.ziel@uni-due.de

Received: 20 February 2020; Accepted: 10 March 2020; Published: 20 March 2020

**Abstract:** We introduce the problem of load nowcasting to the energy forecasting literature. The recent load of the objective area is predicted based on limited available metering data within this area. Thus, slightly different from load forecasting, we are predicting the recent past using limited available metering data from the supply side of the system. Next, to an industry benchmark model, we introduce multiple high-dimensional models for providing more accurate predictions. They evaluate metered interconnector and generation unit data of different types like wind and solar power, storages, and nuclear and fossil power plants. Additionally, we augment the model by seasonal and autoregressive components to improve the nowcasting performance. We consider multiple estimation techniques based on the lassoand ridge and study the impact of the choice of the training/calibration period. The methodology is applied to a European TSO dataset from 2014 to 2019. The overall results show that in comparison to the industry benchmark, an accuracy improvement in terms of MAE and RMSE of about 60% is achieved. The best model is based on the ridge estimator and uses a specific non-standard shrinkage target. Due to the linear model structure, we can easily interpret the model output.

**Keywords:** load forecasting; electricity consumption; lasso; Tikhonov regularization; load metering; preliminary load

#### **1. Introduction and Motivation**

In electricity system management, there is a wide range of load forecasting literature [1]. On a high hierarchy level, usually, the transmission system operator (TSO) and sometimes the distribution system operator (DSO) are responsible for the metering and publishing of the load in the corresponding electricity system. When it comes to the details, there exists a wide range of definitions for electrical load; see, e.g., [2,3]. In many countries, there exist accounting rules for the system operator, which define the metering process for billing and management purposes. Thus, from the economic point of view, these load values are very important for the generation and consumption side such as the system operator. However, in many countries, these values are finally published with a large delay with respect to delivery. For instance, PJM published the final metered load values with a delay of up to 90 day. Similarly, in Germany, the TSO published those final metered values in accordance with the accounting rules with a similar delay of up to three months.

In practice, the system operators also publish electrical load real-time data just after delivery with a very small time lag, usually less than an hour. Those load values are often referred to as preliminary/actual/instantaneous/estimated load, depending on the considered market. Of course, these preliminary load values should be as close as possible to the final metered load values that are computed with respect to the accounting rules for the electricity system. Still, there are usually deviations, which might deviate substantially in magnitude. For the computation of the preliminary load, the system operator usually only has limited metering data available to deduce the load values for the overall electricity system.

In this paper, we address the problem of providing more accurate preliminary load values just after delivery when there is only limited metering information in the system available. Those preliminary load values should be as close as possible to the metered load, which is derived with respect to the accounting and metering rules. The academic literature on this topic available is very limited; see [4].

We contribute to this topic and propose an efficient and robust method for nowcasting load using machine learning and data science techniques. In the data science and forecasting literature, especially in applications to economics and meteorology, the phrase nowcasting is used for predicting extremely short-term forecast or predicting the very recent past [5–9]. As mentioned, in our electricity load situation, we are exactly in the case of predicting the very recent past load values under limited data availability. Hence, we propose the phrase load nowcasting for these situations.

In this manuscript, we first introduce the nowcasting problem in detail. Then, we propose several nowcasting models that are oriented to the load forecasting literature. Afterwards, we proceed with a nowcasting study to validate the models and discuss the corresponding results, including the interpretation of the best performing model. We close with a summary and some conclusions.

#### **2. The Nowcasting Problem**

#### *2.1. Formal Problem Description*

Based on the accounting rules, the system operator has to compute the final load values of the objective region for which he/she is responsible. We denote *Yt* the corresponding load values at time point *t*. The detailed computation depends on the regulatory details and the mentioned accounting rules of the considered electricity market. Still, independent of the market, all accounting rules that determine the load *Yt* have in common that they specify the system balance, so the match of the supply and demand, the interconnection with neighboring areas, and potential grid losses.

Under the assumption of no grid losses, we could state for each time point *t* that:

$$Y\_t = \sum\_{i} \text{Consumption\\_of\\_unit}\_{i,t} + \sum\_{i} \text{Interconrector\\_balance}\_{i,t}$$

where Consumption\_of\_unit*i*,*<sup>t</sup>* is the electricity consumption of unit *i* and Interconnector\_balance*i*,*<sup>t</sup>* the imbalance of interconnector *i*. Obviously, both sums are taken across all consumption units and interconnectors. Of course, from the generation point of view, we can also state:

$$Y\_t = \sum\_{i} \text{Generation\\_of\\_unit}\_{i,t} + \sum\_{i} \text{Interconnertor\\_balance}\_{i,t}$$

where Generation\_of\_unit*i*,*<sup>t</sup>* is the generated electricity of generation unit *i* and Interconnector\_balance*i*,*t*. In practice, the latter is easier to compute as we have usually less production units (mainly large power plants) than consumers. Therefore, the latter is usually applied for deriving the load. Moreover, the generation units are often divided into subgroups, dependent on generation type, which could be nuclear, lignite, coal, natural gas, oil, pump storage, hydro, biomass, wind, and solar, among others. In the formulas, the generation based equation above turns into:

$$Y\_t = \sum\_{i} X\_{G,i,t} + \sum\_{i} X\_{I,i,t} \tag{1}$$

where *XG*,*i*,*<sup>t</sup>* is the generation of generation unit *i* and *XI*,*i*,*<sup>t</sup>* the interconnector balance *i*. Again, the sums are taken across all units and interconnectors in the balancing area.

As introduced above, the key problem is that there is only limited metering information at the time of prediction. Therefore, some generation units or interconnectors are not metered (yet) and reduce the number of available time series. Thus, we used:

$$Y\_t = L\_t + \varepsilon\_t \tag{2}$$

$$\mathbf{x} = \sum\_{i=1}^{\mathcal{I}\mathcal{I}} X\_{\mathcal{G},i,t} + \sum\_{i=1}^{\mathcal{I}\mathcal{I}} X\_{I,i,t} + \varepsilon\_t \tag{3}$$

with *Lt* as the overall metered load across all J*<sup>G</sup>* metered generation units and all J*<sup>I</sup>* interconnectors balance time series datasets. The error term *ε<sup>t</sup>* absorbs the missing information of *Yt*, which is not covered by *Lt*, including potential grid losses and contaminated data. In practice, this leads usually to the fact that the sum of all available metered generation units plus the interconnector imbalance *Lt* is well below the targeted load *Yt*. In the application example, we show below that this is about 80% of the overall load. Remember that in a perfect metering environment where J*<sup>G</sup>* and J*<sup>I</sup>* cover all units, it holds *Lt* = *Yt* or, equivalently, *ε<sup>t</sup>* = 0; see (1).

Now, the prediction task is to nowcast (or forecast) *Yt* by *Y t* given the available information up to time *t*, i.e., *XG*,*i*,*<sup>t</sup>* and *XI*,*i*,*t*. A specific restriction is that recent values *Yt*−*<sup>k</sup>* (e.g., *Yt*, *Yt*−1, *Yt*−2, or *Yt*−3) are not available for predicting *Yt*. As mentioned in the Introduction, the last known values usually have a huge delay, often up to 90 days. Thus, we assume that *Yt*−*<sup>K</sup>* is the last known value where *K* is a relatively large number. In the situation of hourly data with 90 days of publication delay, this would be *K* = 24 × 90 = 2160.

#### *2.2. Data and Problem Illustration*

We considered a dataset ranging from 31 December 2014 to 30 April 2019 for the region of a European system operator. The data were metered in quarter-hourly resolution, and if not stated otherwise, all load values are given in MW. There were J*<sup>G</sup>* = 92 generation time series and J*<sup>I</sup>* = 5 five interconnection balance time series available. The generation time series contained seven wind power series and five solar series and a diverse collection of power plant productions of different types: nuclear, lignite, coal, natural gas (NG), oil, pump storage, hydro, and biomass. Potential missing data were replaced by the last known values. Moreover, we applied clock change adjustments to the data due to daylight savings time. Hence, for the last Sunday in March, we interpolated the missing clock change hour, and for the last Sunday in October, we averaged the doubling clock change hour.

In Figure 1, we illustrate an example of the considered dataset for the last week of April 2019. We observed that the load process *Yt* exhibited the typical daily pattern with smaller values during night than during day time, and smaller values on the weekend than on working days. Additionally, we see that the process *Lt* (see Equation (3)) is the sum of all available meters series *XG*,*i*,*<sup>t</sup>* and *XI*,*i*,*t*. Note that metering data exhibited negative values, and this held particularly for the transmission data of the interconnectors and the storages. Thus, only if all metered data were positive, the process *Lt* was visually that of all individual generation and interconnector data. Such a particular example period can be spotted for the last hours of Sunday in Figure 1.

Further, we observed that during the illustrated period, the generation had a substantial infeed of wind and solar power. Additionally, we see that nuclear power provided base load energy, but also some coal power plants in the last two days of April 2019. The remaining power plant contributed only little to the energy supply during this period. Finally, we want to highlight that *Lt* followed the same pattern as *Yt*, but lied consistently below *Yt*. This also motivated the first simple model for predicting *Yt*.

**Figure 1.** Time series plot of the load *Yt* and the process *Lt* with its single components *XG*,*i*,*<sup>t</sup>* and *XI*,*i*,*<sup>t</sup>* classified by generation type in the last week of April 2019.

#### **3. Nowcasting Models**

#### *3.1. Benchmark Model*

The industry benchmark from the system operator solves the problem stated above by a linear regression on *Lt* motivated by Equation (2). Thus,

$$Y\_t = \mathfrak{a}\_0 + \mathfrak{a}\_1 L\_t + \mathfrak{e}\_t \tag{4}$$

To estimate the unknown coefficients *α*<sup>0</sup> and *α*1, the industry benchmark applies ordinary least squares (OLS) to the past years data of the same month of the target time *t*. Thus, if we want to predict *Yt*, which is in January, we take all January values for *Yt* and *Lt* of the previous year to estimate *α*<sup>0</sup> and *<sup>α</sup>*1. As we had quarter-hourly data, this was 31 <sup>×</sup> 96 data points. By OLS, we used *α*<sup>0</sup> and *α*<sup>1</sup> and computed nowcasts *Yt* by *<sup>Y</sup><sup>t</sup>* <sup>=</sup> *α*<sup>0</sup> <sup>+</sup> *α*1*Lt*.

The estimation principle is visualized in Figure 2. Here, *α*<sup>0</sup> and *α*<sup>1</sup> of Model (4) were estimated using the input data from April 2018 for estimating *Yt* in April 2019. Note that we will generalize this estimation method slightly and consider a broader range of training periods options in the application.

**Figure 2.** (Left) Scatter plot of the process *Lt* (see (3)) and load *Yt* in April 2018 with the fitted line of Model (4). (Right) Time series plot of *Yt*, *Lt*, and *<sup>Y</sup>t* = *α*0 + *α*1*Lt* for the last week of April 2019 as in Figure 1.

#### *3.2. Proposed Nowcasting Model*

The proposed model was motivated by Equation (3). First, we imposed a linear model on the individual generation and interconnector components by:

$$Y\_t = \beta\_0 + \sum\_{i=1}^{\mathcal{I}\_G} \beta\_{G,i} X\_{G,i,t} + \sum\_{i=1}^{\mathcal{I}\_I} \beta\_{I,i} X\_{I,i,t} + \varepsilon\_t \tag{5}$$

$$\mathbf{x} = \boldsymbol{\beta}\mathbf{o} + \mathbf{f}\_{\mathrm{G}}^{\prime}\mathbf{X}\_{\mathrm{G},\ell} + \mathbf{f}\_{I}^{\prime}\mathbf{X}\_{I,\ell} + \varepsilon\_{\mathrm{I}}\tag{6}$$

$$
\omega = \beta\_0 + \mathcal{B}\_F' X\_{F,t} + \varepsilon\_t. \tag{7}
$$

with *XF*,*<sup>t</sup>* = (*XG*,*t*, *XI*,*t*). We regarded this as a fundamental linear load model, as the only linear inputs were *XF*, which contained all fundamental information: the generated power data *XG*,*<sup>t</sup>* and the interconnector imbalance *XI*,*t*. Note that (7) can be regarded as natural extension of (4) because Model (7) turns into (4) by choosing *β<sup>i</sup>* = *α*<sup>1</sup> for *i* > 0.

However, we extended Model (7) by two further terms: (i) a term that contains seasonal information and (ii) a term that represents autoregressive information. In load forecasting, both terms showed high relevance; see, e.g., [10–13]. Sometimes, models with many seasonal and autoregressive components performed even very well in short-term forecasting; see, e.g., [14].

Formally, the extended model is given by:

$$Y\_t = \beta\_0 + \mathcal{B}\_F' X\_{F,t} + \mathcal{B}\_S' X\_{S,t} + \mathcal{B}\_A' X\_{A,t} + \varepsilon\_t. \tag{8}$$

$$\mathbf{x} = \beta\_0 + \boldsymbol{\mathcal{J}}' \mathbf{X}\_t + \varepsilon\_t \tag{9}$$

where *XS*,*<sup>t</sup>* is a vector of seasonal regressors and *XA*,*<sup>t</sup>* is a vector of autoregressive components of *Yt*. Of course, (8) turns into (7) by choosing *β<sup>S</sup>* = **0** and *β<sup>A</sup>* = **0**. Note that we also defined *<sup>X</sup><sup>t</sup>* = (*XF*,*t*, *<sup>X</sup>S*,*t*, *<sup>X</sup>X*,*t*), which did not include the intercept. Hence, *<sup>β</sup>* = (*βF*, *<sup>β</sup>S*, *<sup>β</sup>A*) did not include *β*0.

It is widely known that in electricity demand, load and consumption modeling periodic features play an important role. The most important seasonalities are daily, weekly, and annual cycles. We suggested to model the three periodic components by periodic cubic by splines with periodicities *SD*, *SW*, and *SA*, which represent a day, a week, and a (meteorologic) year, as in [15]. For out quarter-hourly data, we had *SD* = 96, *SW* = 96 × 7 = 672, and *SA* = 96 × 365.24 = 35,063.04. In contrast to Fourier

analysis, periodic B-splines have the advantage that the basis functions are local and allow for flexibility. When applied to positive data with positivity constraints, they also benefit from the fact that they are always positive. We chose equidistant basis functions for each period. Additionally, we specified the number basis functions *BD*, *BW*, and *BA* for each period. For our application, we chose *BD* = 24, *BW* = 12, and *BA* = 24. Thus, *β<sup>S</sup>* had a length of *BD* + *BW* + *BA*, which was 60 in our application.

Furthermore, we had to specify the autoregressive term in (8). We defined the autoregressive components:

$$\mathbf{X}\_{A,t} = ((\mathbf{Y}\_{t-k})\_{k \in \mathcal{K}\_{\mathbf{K}'}} (\mathbf{Y}\_{t-k})\_{k \in \mathcal{K}\_{\mathbf{A}}})$$

with two sets of lags K<sup>K</sup> and KA. K<sup>K</sup> contained lags around the most recent available *Yt*−*K*, and K<sup>A</sup> contained lags around a calendar year ago. The latter mimicked annual effects.

We specified for the most recent lags:

$$\mathcal{K}\_{\mathbb{K}} = \{ \mathbb{K} + (0, 1, \dots, 8), \mathbb{K} + \mathbb{S}\_{\mathbb{D}} + (-8, -7, \dots, 8), \mathbb{K} + 2\mathbb{S}\_{\mathbb{D}}, \mathbb{K} + 8\mathbb{S}\_{\mathbb{D}} \}$$

which contains the nine most recent known values, the values eight hours around the day before *Yt*−*K*, and the lags of the past eight days at the same hour as *t*. Remember that *K* = 90*SD* in our application; thus, *Yt*−*<sup>z</sup>* with *z* = *K* + *SD* = 91*SD* = 13 × 7*SD* = 13*SW* had the same weekday as the target *Yt*. For lag around a year ago, we specified:

$$\mathcal{K}\_{\mathcal{A}} = \{ \mathfrak{F}0\mathcal{S}\_{\mathcal{W}}, \mathfrak{F}2\mathcal{S}\_{\mathcal{W}} + (-8, -7, \dots, -1, 1, 2, \dots, 8)\mathcal{S}\_{\mathcal{D}}, \mathfrak{F}2\mathcal{S}\_{\mathcal{W}} + (-8, -7, \dots, 8), \mathfrak{F}4\mathcal{S}\_{\mathcal{W}} \}$$

as 52*SW* = 364 is approximately one calendar year. In total, K<sup>K</sup> and K<sup>A</sup> contributed 54 parameters to the model.

To summarize, the overall (8) had many parameters. In our application scenario, in total, there were 5 + 12 + 80 + 60 + 54 = 211 parameters. As this might lead to overestimation issues when applying plain OLS, we proposed the application of efficient regularization techniques to tackle the nowcasting problem adequately.

#### *3.3. Estimation of Proposed Nowcasting Model*

We will see that the estimation procedure (or training method) played an important role in an accurate nowcasting. Obviously, a natural estimation candidate for Model (8) is linear regression. However, as we had many parameters and some of them might contain useless information, this might be suboptimal. Regularization can help to address the problem. In the energy forecasting literature, the lasso (least selection and shrinkage operator) seems to be a popular choice for shrinkage and feature selection methods in linear models; see, e.g., [15–18]. An extension of the lasso is given by the elastic net, which also has been applied [19–25].

For introducing the estimation procedure, we require some further notations. Let {1, ... , *T*} be the time points of available data for *Yt*. Thus, our objective was to predict the load *Yt* at time point *<sup>t</sup>* <sup>=</sup> *<sup>T</sup>* <sup>+</sup> *<sup>K</sup>*, which corresponds to the actual time point. Let <sup>T</sup> ⊆ {1, ... , *<sup>T</sup>*} be the training period of size *<sup>n</sup>*T. Define <sup>Y</sup>(*m*0,*s*0) = ((*Yt* <sup>−</sup> *<sup>m</sup>*0)/*s*0)*t*∈<sup>T</sup> as the (*m*0,*s*0)-standardized response vector and <sup>X</sup>(*mp*, *<sup>s</sup>p*)=(X*i*(*mi*,*si*))*i*∈{1,...,*p*} = ((*Xi*,*<sup>t</sup>* <sup>−</sup> *mi*)/*si*)(*i*,*t*)∈{1,...,*p*}×<sup>T</sup> as the scaled input matrix with scaling coefficients *m<sup>p</sup>* = (*m*1, ... , *mp*) and *s<sup>p</sup>* = (*s*1, ... ,*sp*) and number of input parameters *p*. Denote *m* = (*m*0, *m*1,..., *mp*) and *s* = (*s*0,*s*1,...,*sp*) the collections of all scaling coefficients.

Furthermore, denote *c* as a vector of the same size as *β*, which will be the shrinkage target. In the vast majority of applications, this is *c* = **0**. The intuition behind this choice is that a specific regressor has zero impact if it contains useless information, to reduce the garbage in, garbage out problem.

With all the notations above, the elastic net estimator for *β* in Model (8) is given as:

$$\hat{\boldsymbol{\beta}}\_{\lambda,\mathfrak{a}}(\mathfrak{m},\mathfrak{s};\mathfrak{c}) = \underset{(\mathfrak{k}\_0,\mathfrak{k}) \in \mathbb{R}^{\tau+1}}{\arg\min} \frac{1}{\mathfrak{l}^{\tau}\boldsymbol{\mathcal{I}}} \left\|\mathbb{Y}(\boldsymbol{m}\_0,\mathfrak{s}\_0) - \boldsymbol{\beta}\_0 + \boldsymbol{\mathcal{J}}^{\boldsymbol{\prime}}\mathbb{X}(\boldsymbol{m}\_{\mathcal{P}},\mathfrak{s}\_{\mathcal{P}})\right\|\_{2}^{2} + \lambda\mathfrak{a}\left\|\mathfrak{k} - \mathfrak{c}\right\|\_{1} + \lambda\frac{1-\mathfrak{a}}{2} \left\|\mathfrak{k} - \mathfrak{c}\right\|\_{2}^{2} \tag{10}$$

where *λ*, *α* ≥ 0 are tuning parameters, *p* is the number of parameters (length of *β*), and · <sup>1</sup> and · <sup>2</sup> as the standard -<sup>1</sup> and -<sup>2</sup> norm. The tuning parameters *λ* and *α* characterize the regularization properties of the elastic net. For *α* = 1, we used the popular choice of the lasso (least absolute shrinkage and selection operator), and for *α* = 0, we used the ridge regression, which is also known as Tikhonov regularization. For *λ* = 0, we used the OLS solution, and for very large *λ*, we used a solution very close to the shrinkage target *c*. In the non-ridge case *α* > 0, we even used exactly *c* as the solution if *λ* was sufficiently large. For the case of the ridge regression, we had an explicit solution available. This was:

$$
\widehat{\boldsymbol{\beta}}\_{\lambda,0}(\boldsymbol{m},\boldsymbol{s};\boldsymbol{c}) = \left(\widehat{\mathbb{X}}(\boldsymbol{m}\_{p\prime},\mathbf{s}\_{p})^{\prime}\widehat{\mathbb{X}}(\boldsymbol{m}\_{p\prime},\mathbf{s}\_{p}) + \text{Diag}\left(\widehat{\boldsymbol{\lambda}}\right)\right)^{-1} \left(\widehat{\mathbb{X}}(\boldsymbol{m}\_{p\prime},\mathbf{s}\_{p})\mathbb{Y}(\boldsymbol{m}\_{0\prime},\mathbf{s}\_{0}) + \boldsymbol{\lambda}\widehat{\mathbf{c}}\right).
$$

with <sup>X</sup> (*mp*, *<sup>s</sup>p*)=(**1**, <sup>X</sup>(*mp*, *<sup>s</sup>p*)), *<sup>λ</sup>* = (0, *λ***1**) , and *c* = (0, *c*) . In the elastic net or lasso case with *α* > 0, we had efficient estimation techniques based on the coordinate descent or LARS (least angle regression) available. Both options had the drawback that they could only handle the case *c* = **0**.

However, also the scaling coefficients *m* and *s* impacted the estimation substantially. Usually, the scaling coefficients *<sup>m</sup>* and *<sup>s</sup>* in (10) are standardized so that Y(*m*0,*s*0) remains unchanged by *<sup>m</sup>* = 0 and *<sup>s</sup>* = 1, and X*i*(*mi*,*si*) has mean zero and standard deviation of one, i.e., it holds that X*i*(*mi*,*si*) **<sup>1</sup>** <sup>=</sup> 0 and <sup>X</sup>*i*(*mi*,*si*) <sup>2</sup> <sup>=</sup> 1. The latter can be achieved by choosing *mi* <sup>=</sup> *<sup>n</sup>*−<sup>1</sup> <sup>T</sup> X *i* **1** and *si* = *n*−<sup>1</sup> <sup>T</sup> (X*<sup>i</sup>* <sup>−</sup> *mi***1**)(X*<sup>i</sup>* <sup>−</sup> *mi***1**). This scaling procedure is standard in the literature and, e.g., the default in the glmnet or lars packages in R for estimation of the elastic net and lasso estimation with *c* = **0**.

Still, it turned out that for our nowcasting problem, the scaling procedure for X was suboptimal as we ignored historic observations. It is true that *YT* was the last known observation. However, for *Xt*, we knew all observations up to *T* + *K*, the time point when the forecast was created. Thus, we proposed to compute the scaling coefficients *si* and *mi* on the larger and more recent information set <sup>T</sup>*<sup>K</sup>* <sup>=</sup> <sup>T</sup> ∪ {*<sup>T</sup>* <sup>+</sup> 1, ... , *<sup>T</sup>* <sup>+</sup> *<sup>K</sup>*} for all <sup>X</sup>*i*. Moreover, we suggested for reasons explained in the next paragraph to scale the Y(*m*0,*s*0) by the corresponding sample mean and standard deviation *m*<sup>0</sup> = *n*−<sup>1</sup> <sup>T</sup> Y **1** and *s*<sup>0</sup> = *n*−<sup>1</sup> <sup>T</sup> (<sup>Y</sup> <sup>−</sup> *<sup>m</sup>*0**1**)(<sup>Y</sup> <sup>−</sup> *<sup>m</sup>*0**1**).

Now, we discuss the impact of the shrinkage target *c* in more detail. We mentioned already that the standard choice *c* = **0** was motivated by the fact that by default, a regressor has no influence. Only if a regressor contributes substantially to the explanation of the response *Yt*, the estimated coefficient will deviate from zero and show a corresponding impact. If we have no further information about our regressors, this is a reasonable approach. We will apply this approach to the ridge and lasso estimator and denote them by **0-ridge** and **0-lasso**.

However, in our situation, we knew something about the fundamental relationship between our response vector *Yt* and the fundamental regressors *XF*,*<sup>t</sup>* from Equations (1) and (7). This fundamental relationship could help to impose a suitable regularization for our model. We explain this with the following example: Suppose there is a situation where in the in-sample period or training period {0, . . . , *T*}, a certain power plant or interconnector is offline; thus, all observations are zero. A reason could be that it is a new unit that just started operating somewhere after the last observation known *YT*. Then, the ridge or lasso estimators with *c* = **0** will give an estimated coefficient of zero for the corresponding unit. Hence, the power plant will have no out-of-sample contribution to the overall load even though it is operating now, at *YT*+*K*. Thus, from the fundamental point of view, it makes sense to deviate from the shrinkage target of **0** for all generation units or interconnectors. If we assume that the metered values are reasonable, eps. not contaminated by implausible data, then taking these values into account should improve the forecasting accuracy. This holds at least for the situation just explained. Hence, we proposed the choice:

$$\mathbf{c}\_{\mathbb{C}} = (\mathbf{c}\_{F\prime}\mathbf{c}\_{A\cdot S})^{\circ}$$

with *c<sup>F</sup>* = **1** and *cA*,*<sup>S</sup>* = **0** corresponding to the impact as in the perfect fundamental situation from Equation (1). Obviously, the vector *c<sup>F</sup>* had a length of *βF*, and *cA*,*<sup>S</sup>* had the aggregated length of *β<sup>A</sup>* and *βF*. We applied this choice for the ridge regression only and denoted it by **c-ridge**. The reason why this choice was not applied to lasso or elastic net estimators with *α* > 0 was the unavailability of efficient estimation algorithms.

#### **4. Nowcasting Study**

We conducted a rolling window nowcasting study using the considered European dataset, and the design was similar to a standard rolling window forecasting study, as illustrated in Figure 3. The initial last known load value *YT* was on 29 January 2018 at 23:45. Based on historic data, we nowcast the *SD* = 96 values *YT*+*K*<sup>+</sup>1, ... ,*YT*+*K*+*SD* . We considered a publication delay of *K* = 90 × 96 = 8640 (90 days), which resulted in the first nowcast being on 30 April 2018, approximately three months later. Then, we shifted the last known load value by a day (*SD* = 96 time points) to *YT*+*SD* and nowcast *YT*+*K*+*SD*+1, ... ,*YT*+*K*+2*SD* . This procedure was repeated *N* = 366 times, which gave an out-of-sample time of about a year and around 96 × 366 = 35,136 observations for evaluation. For the in-sample dataset, we considered for our application six choices:


Option (i) used the maximum amount of data of (365 × 3 + 30 − 90) = 1035 days, which was also used for illustration in Figure 3. Note that Option (vi) was very close to the industry benchmark approach, which used the data of the month of the previous year for estimating the model parameters.

**Figure 3.** Illustration of the nowcasting study design.

We considered the all competing models, **benchm**, **0-lasso** (*λ*), **0-ridge** (*λ*), and **c-ridge**(*λ*) in the rolling window forecasting study. As emphasized, the lasso and ridge models depended on the tuning parameter *λ*, which we had to specify. For all models, we considered exponential grids Λ for *λ*; in detail: For the ridge models, we chose Λ = 2L*<sup>r</sup>* with L*<sup>r</sup>* as an equidistant grid from −10 to 20 of length 100, and for the lasso models, Λ = 2L*<sup>l</sup>* as an equidistant grid from −30 to 3 of length 100. Of course, we did not know in advance the optimal *λ*. Therefore, we considered for the **0-lasso**, **0-ridge**, and **c-ridge** models a version where *λ* was chosen on the past performance (cumulated loss) of the the corresponding models, initializing with *λ* = 1 for the first prediction. We denoted the models by **0-lasso**∗, **0-ridge**∗, and **c-ridge**∗.

For measuring the nowcasting accuracy or measures for forecasting performance, we considered the out-of-sample MAE (mean absolute error) and the out-of-sample RMSE (root mean square error). To evaluate the forecasting accuracy also for each of the *SD* = 96 quarter-hours separately, we defined:

$$\text{MAE} = \frac{1}{S\_D} \sum\_{s=1}^{S\_D} \text{MAE}\_s \quad \text{with} \quad \text{MAE}\_s = \frac{1}{N} \sum\_{n=1}^{N} |Y\_{i,s} - \hat{Y}\_{i,s}| \tag{11}$$

$$\text{RMSE} = \frac{1}{S\_D} \sum\_{s=1}^{S\_D} \text{RMSE}\_s \quad \text{with} \quad \text{RMSE}\_s = \sqrt{\frac{1}{N} \sum\_{n=1}^{N} |\boldsymbol{\gamma}\_{i,s} - \hat{\boldsymbol{Y}}\_{i,s}|^2} \tag{12}$$

Note that our models were regression based, and the forecasted value should coincide with the the expected value. Thus, the RMSE should be preferred for evaluation as it identified the true mean correctly. In contrast, the MAE was optimal for median forecasts. However, it is often used as a robust alternative to the RMSE. For more details on the evaluation of point forecasts, we refer to [26].

#### **5. Results**

#### *5.1. Nowcasting Performance*

We first discuss the overall nowcasting performance of the considered models. The out-of-sample MAE and RMSE values are given in Table 1 and 2. There, we also computed improvements in the MAE and RMSE with respect to the benchmark model **benchm** estimated on the shorted training period **1month**. Remember that **ridge**∗ and **lasso**∗ chose the tuning parameter based on the past performance, whereas ridge and lasso represented the models that gave ex-post the best prediction accuracy on the *λ*-grid Λ.

**Table 1.** Out-of-sample MAE in MW with relative improvement in % with respect to the benchmark trained on the shortest training period for all models and training periods. A heat map is used to indicate better (→ green) and worse (→ red) performing models.



**Table 2.** Out-of-sample RMSE in MW with relative improvement in % with respect to the benchmark trained on the shortest training period for all models and training periods. A heat map is used to indicate better (→ green) and worse (→ red) performing models.

First, we observe that all ridge and lasso models showed clear improvements against the benchmark. The largest improvement of around 60% in both measures was gained by the **c-ridge** (or **c-ridge**∗) model calibrated on the training period of **2years**. Second, we see that the **ridge**∗ and **lasso**∗ models showed almost the same performance as **ridge** and **lasso**, which indicated that the ex-post selection of *λ* was not a big problem. Next, the benchmark model **benchm** with short calibration periods of **1month** and **2months** showed the best prediction accuracy against the benchmark model. In contrast, the ridge and lasso approaches showed that long training periods of **2years** and **3years**performed best. The reason was likely that the estimation of many parameters required more data to receive stable parameter estimates. Figure 4 illustrates the solution path of the ridge and lasso models for a calibration period **2years** which uses about two years of data.

Here, the · 1-norm of *β* as a typical measure for model complexity is plotted against the MAE and RMSE score. Note that *β* <sup>1</sup> is the sum of all absolute parameters. The solution paths for different *λ* values of a certain model e.g., **c-ridge** (*λ*) (red circle), are represented by the color intensity. The darker the color of the symbol within the solution path, the smaller *λ*. Thus, black symbols correspond to the OLS solution.

We observe that all three models **c-ridge** (*λ*), **0-ridge** (*λ*), and **0-lasso** (*λ*) converged to the the OLS solution for small *λ*. The OLS solution had an MAE of around 500 MW and an RMSE of slightly above 700 MW with an · 1-norm of *β* of around 5.5. We clearly see that for small *λ* values, **0-ridge** (*λ*) and **0-lasso** (*λ*) obtained smaller *β* values and tended towards the **0** solution. In contrast, **c-ridge** (*λ*) had always a similar range of the · 1-norm of *β*. The corresponding MAE and RMSE minima has a · 1-norm around 5.2, which is a similar magnitude as the OLS solution. Thus, the parameter complexity of both solutions was comparable, but the parameters were better selected by the **c-ridge** approach due to the shrinkage towards a reasonable target, instead of **0**.

Next, we wanted to look at the intraday structure of the nowcasting errors across the 96 quarter-hours. The forecasting accuracy in term of MAE*s* and RMSE*s* is visualized in Figure 5. There, we observe that the benchmarks exhibited a relatively clear diurnal pattern. The nowcasting error was largest during the working hours, esp. during the afternoon. For the lasso and ridge models, the daily pattern was substantially reduced. For instance, the MAE*<sup>s</sup>* of **c-ridge**∗ varied between 383 MW and 484 MW, which was a variation of around 100 MW. The intraday MAE h variation of the MAE of the benchmark model was around 300 MW and significantly larger. However, as the overall forecasting error reduced by 60%, the relative variation of the of the MAE forecasting performance remained at a similar level.

 *β* <sup>1</sup> against MAE (left) and RMSE (right) of the selected lasso and ridge models, illustrating the solution paths for different *λ* values. The darker the color, the smaller the shrinkage (black = OLS).

**Figure 5.** Intraday prediction accuracy in MAE*s* and RMSE*s* of selected models.

We saw that the proposed models with an in-sample sample size of about two years performed best. It was clear that the computational complexity increased with the amount of data used for training and calibration. Still, in all cases, the models allowed the implementation and application on a real-time basis due to the linear model structure. For instance, the estimation of the **c-ridge**, **0-ridge**, and **0-lasso** models on the full *λ*-grid with a training period of **2years** took 3.0 s, 0.5 s, and, 2.3 s, respectively. These times were measured on a standard computer using a simple CPU. The ridge models were estimated using the solve.QP function of the R package quadprog, and the lasso model was trained and calibrated using glmnet function of the R package glmnet.

#### *5.2. Model Interpretation*

As our models were linear models, it was relatively easy to interpret the parameters. The easiest way to get an understanding of the impact of each parameter in the model was to evaluate the absolute impact of parameter *i* with respect to the overall parameter contribution |*βi*|/ *β* 1. Those impacts of the **c-ridge**∗ model with a training period of about two years such as the benchmark model **benchm** with training period of about a month are illustrated in the bar chart in Figure 6. As the full model had many parameters, we grouped the impacts |*βi*|/ *β* <sup>1</sup> by parameter type to maintain readable results.

**Figure 6.** Bar chart of the absolute impact |*β i*|/ - *β* <sup>1</sup> of Model **c-ridge**<sup>∗</sup> for **2years** and **benchm** for **1month** grouped by parameter type.

Obviously, we saw that the only the c-ridgemodel had a contribution from external regressors and autoregressive impacts (EXT\_A, EXT\_W, and EXT\_D represent the annual, weekly, and daily seasonal components; LAGS\_A and LAGS\_S represent the annual and short-term autoregressive lags), as the benchmark model did not take those effects into account. Here, it seemed that the annual impacts contributed substantially to the **c-ridge**∗ model, and this held for both types' effects from deterministic external regressors (EXT\_A) and autoregressive effects (LAGS\_A). Furthermore, the daily seasonal component (EXT\_D) showed about a 3.5% contribution to the overall solution. For the generation units, we observed that all reduced their absolute impact in the **c-ridge**∗ model with respect to the benchmark model. However, all parameters remained relevant.

The interpretation by the absolute impacts |*βi*|/ *β* <sup>1</sup> was suitable for evaluation of the impact within the estimated model. However, the regressors *Xi*,*<sup>t</sup>* lived on completely different scales. To obtain interpretable impacts with respect to the load *Yt*, we had to evaluate the time series of *βiXi*,*t*, which represented the impact of each single component to the final model. Therefore, Figure 7 shows a time series plot of the actual load *Yt*, the benchmark model **benchm** nowcasts, and the **c-ridge**∗ model nowcasts, along with the estimated contributions *βiXi*,*<sup>t</sup>* for each regressor *i*.

**Figure 7.** Time series plot of the actual load *Yt* (black), with the fitted model of the benchmark model (red) and the **c-ridge**∗ approach (blue) on 6–12 August 2018. Additionally, the estimated impact of the single components *β iXi*,*<sup>t</sup>* for the **c-ridge**<sup>∗</sup> model (bottom) and benchmark model (top) classified by type with different colors is illustrated.

We observed that for both models, the interconnector, wind, and solar contributed substantially to the final solution. For the **c-ridge**<sup>∗</sup> nowcast, a very important contribution to *Y<sup>t</sup>* came from the annual autoregressive impacts (LAG\_A). It mainly had positive contributions, but also some negative contributions. For the **c-ridge**∗ nowcast, some moderate impact could be seen from the nuclear power and hydro. The latter contributed more to the negative side than to the positive, which was a bit surprising, as the fundamental model would suggest a positive impact. Furthermore, the benchmark

model had no negative contribution from hydro power. All other generation types had only a minor impact for both considered models. Finally, we observed that the intercept contributed around 2000 MW to the final contribution of the **c-ridge**∗ model, which was about 10% of the overall load *Yt*. Remember that about 80% of the load *Yt* was metered (by generation units and interconnectors). Thus, from the missing 20% load, around a half (=10%) seemed to be base load.

#### **6. Summary and Conclusions**

We formally introduced the problem of load nowcasting to the energy forecasting literature. In contrast to load forecasting, the recent load of a certain balancing area was predicted based on limited available metering data within this area. Thus, we were predicting the recent past. We introduced an industry benchmark model and multiple high-dimensional linear model to tackle the nowcasting problem. The model design orientated from load forecasting problems. Next to the impacts of metered generation and interconnector units, the models had seasonal and autoregressive components to improve the prediction performance. We considered multiple estimation techniques based on lasso and ridge and studied the impact of the choice of the training/calibration period.

The overall results showed that in comparison to the industry benchmark, an accuracy improvement in terms of MAE and RMSE of about 60% was achieved. The best model was based on the ridge estimator and used a specific non-standard shrinkage target. Moreover, we highlighted that the model parameters could be interpreted. The overall results showed that the annual effects (deterministic and autoregressive) contributed significantly to the proposed ridge model.

Future research could investigate more nowcasting models, especially non-linear ones, like artificial neural networks or support vector machines. Obviously, the study could be extended to probabilistic nowcasting. The considered nowcasting models could also serve a basis for the construction of load forecasting models. Here, the generation and interconnector units *Xi*,*<sup>t</sup>* had to be considered in a lagged manner (*Xi*,*t*−*k*), potentially for multiple lags. In general, many methodologies can be transferred from energy forecasting, especially from short-term load forecasting.

Finally, the model accuracy might be enriched by the use of more external information. In load forecasting, the (average) temperature of a objective area is often seen as highly relevant. Thus, the incorporation into a nowcasting model could be beneficial as well. This information can be added easily by adding the temperature (and potential non-linear transformations) as a new regressor to the model. We can also add further dummy variables that characterize known structural breaks, e.g., for changes in the regulation or reshaping of the balancing area. Furthermore, it was clear that additional metering information would improve the nowcasting accuracy. With respect to renewable energy information from wind and solar power, a finer geographical resolution might improve the forecasting accuracy, as Figure 7 shows a high importance for a few individual time series of the **c-ridge**∗ model with respect to the benchmark model.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


c 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Energies* Editorial Office E-mail: energies@mdpi.com www.mdpi.com/journal/energies

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18