*3.2. Methodology*

This section describes the proposed methodology for forecasting time-series using a deep learning approach. There are various deep learning architectures which can be used for time-series forecast, such as convolutional neural nets (CNN), recurrent neural nets (RNN) or feed-forward neural nets (FFNN).

In this paper, a deep feed-forward network has been used, implemented by R package H2O [49]. H2O is an open-source framework that implements various machine learning techniques in a parallel and distributed way using a single machine or a cluster of machines, being scalable for big data projects.

Among the algorithms included in H2O, we can find a feed-forward neural network, that is the most common network architectures. The main characteristic of this net is that each neuron is a basic element of processing and their information is propagated through adjacent neurons.

In addition, in order to select the configuration of the network hyperparameters, we used a GA, which was implemented by using the GA R package [50].

#### 3.2.1. Parameters of the Neural Network

The network architecture implemented in the H2O package needs to be configured by setting different parameters, that will affect the behavior of the neural network and influence the final results. The most important parameters are: number of layers, neurons per hidden layer, L1 (*λ*), *ρ*, , activation and distribution functions and end metric. These are the parameters that the GA will optimize.

The parameter *λ* controls the regularization of the model by inserting penalties in the model creation process in order to adjust the predictions as much as possible with actual values and the penalization is defined by the following equation:

$$\lambda \sum\_{i=0}^{n} \left| w\_i \right| \,. \tag{1}$$

In Equation (1), *n* is the number of weights received by the neurons and *wi* represents the weight for the neuron *i*.

The parameter *ρ* allows us to manage the update of different weights of synapses and is used to maintain some consistency between the different updates of previous weights.

The parameter prevents the deep learning algorithm from being stuck in local optimums or to skip a global optimum, and can assume values between 0 and 1.

The activation function can assume three values: tanh (hyperbolic tangent), ramp function, maxout.

Seven different possibilities are considered for the distribution function: Gaussian, Poisson, Laplace, Tweedie, Huber, Gamma and Quantile.

The end metric defines the specific measure that is used to stop early the training phase of the deep learning algorithm. There are seven different possibilities: mean squared error (MSE), Deviance (the difference between an expected value and an observed value), root mean squared error (RMSE), mean absolute error (MAE), root mean squared log error (RMSLE), the mean per class error and lift top group. The last metric is a measure of the relative performance.

The possible values for each parameter are shown in Table 2.


**Table 2.** Search space of the neural network parameters.

As we described before, the activation function, distribution function and end metric are categorical parameters, so each value corresponds to a specific category of the parameter.

#### 3.2.2. Genetic Algorithm Parameters

ˆ

As previously stated, in order to find a sub-optimal set of hyper-parameters, described in the previous section, for the deep learning algorithm, we use a GA. In particular we use the implementation provided by the GA R package [50]. So our proposal lies within the field of neuroevolution.

The GA package contains a collection of general-purpose functions for optimization using genetic algorithms. The package includes a flexible set of tools for implementing genetic algorithms in both the continuous and discrete case, whether constrained or not. However the package does not allow to simultaneously optimize continuous and discrete parameters, so we had to treat all the parameters as continuous, which caused the dimension of the search space to increase drastically.

The package allows us to define objective functions to be optimized, which, in our case, is the forecasting results obtained by a deep neural network built with a specific set of parameters. In fact, each individual of the population encodes the values of the eight parameters shown in Table 2.

Each parameter setting yields a specific deep neural network, which is then applied to the data and the forecasting result represent the fitness of the individual.

In particular, the fitness of an individual is equal to the *MRE* obtained by the deep neural network on the validation set, being the *MRE* defined as:

$$MRE = \frac{1}{n} \sum\_{i=1}^{n} \frac{|\mathbf{Y}\_i - \hat{\mathbf{Y}}\_i|}{\mathbf{Y}\_i},\tag{2}$$

where *Y i* is the predicted value, *Yi* the real value and *Yi* is the mean of the observed data, and *n* is the number of data.

Several genetic operators are available and can be combined to explore the best settings for the current task. After having performed a set of preliminary experiments aimed at setting the GA's parameters, we used, in our implementation, a tournament selection mechanism (with tournament size of 3), the BLX-a crossover (with *a* = 0.5), which combines two parents to generate offspring by sampling a new value in a defined range with the maximum and the minimum of the parents [51]. We used the random mutation around the solution, which allows us to change one value of an element by another value.

The setting of the parameters used in the GA are reported in Table 3. The value shown are those that obtained the best performances in the preliminary runs, but the population size. In fact, better results were achieved with higher population size. However, the computational cost increases dramatically the higher the population size is. In fact, the deep learning algorithm takes around 89.42 s for a number of layers between 2 and 100 and for a number of neurons between 10 and 1000.

The execution of the GA with the deep learning algorithm as a fitness function and with the parameters defined in Table 3 takes around five days. If the population size is doubled, the execution can take more than one week. It is necessary to enhance one of the parameters (population size or number of generations) but not both. Moreover, if the fitness of the best individual does not improve after 50 generations, the GA is stopped.

At the end of the execution, the best individual is returned and used in order to build a deep learning network.


**Table 3.** Genetic algorithm (GA) parameter setting.

#### 3.2.3. Description of the Methodology

The main objective of this work is to predict the next *h* future values, called the prediction horizon, of a time-series [*<sup>x</sup>*1, *x*2,..., *xt*].

The predictions are based on *w* previous values, or, in other words, on a historical data window. This process is called multi-step forecasting, as various consecutive values have to be predicted. The aim of multi-step forecasting is to induce a prediction model *f* , and in our case *f* is obtained by using a deep learning strategy, following the equation:

$$f\left[\mathbf{x}\_{t+1}, \mathbf{x}\_{t+2}, \dots, \mathbf{x}\_{t+h}\right] = f\left(\mathbf{x}\_{t}, \mathbf{x}\_{t-1}, \dots, \mathbf{x}\_{t-(w-1)}\right). \tag{3}$$

Unfortunately, frameworks that provide deep learning networks model, such as H20, does not support this multi-step formulation.

In order to solve this issue, a different methodology has been proposed [9]. The basic idea is to divide the main problem into *h* prediction sub-problems. Then a forecasting model will be induced for each of the sub-problems, as shown in Equation (4).

$$\begin{array}{rcl} \mathbf{x}\_{t+1} &=& f\_1(\mathbf{x}\_t, \mathbf{x}\_{t-1}, \dots, \mathbf{x}\_{t-(w-1)}) \\ \mathbf{x}\_{t+2} &=& f\_2(\mathbf{x}\_t, \mathbf{x}\_{t-1}, \dots, \mathbf{x}\_{t-(w-1)}) \\ &\dots &=& \dots \\ \mathbf{x}\_{t+(h-1)} &=& f\_{(h-1)}(\mathbf{x}\_t, \mathbf{x}\_{t-1}, \dots, \mathbf{x}\_{t-(w-1)}) \\ \mathbf{x}\_{t+h} &=& f\_h(\mathbf{x}\_t, \mathbf{x}\_{t-1}, \dots, \mathbf{x}\_{t-(w-1)}) \end{array} \tag{4}$$

Notice that in this way, we lose the time relationship between consecutive records of the time-series. For instance, instants *t* + 1, *t* + 2, *t* + 3 or *t* + 4 will not be considered when forecasting *t* + 5.

On the other hand, considering such values for the predictions could increment the forecasting error. This is because values for *t* + 1, *t* + 2, *t* + 3 or *t* + 4 are based on predictions, and they would have a negative effect on the forecasts if the values were not precisely estimated.

It follows that a search for optimal parameters should be carried out for each sub-problem, where the evaluation of each individual corresponds to the error made by the neural network in the training phase. This means that the computational time needed to train the complete model is high. However, the capability of H2O to perform distributed computation decreases the total computational time required.

## **4. Experimental Results**

In this section, we present the forecast results obtained on the dataset described in Section 3.1 by the strategy we propose. We also present a comparison with different methods, both standard and ML based.

In order to assess the predictions produced by our proposal, we used the *MRE* measure, as defined in Equation (2). *MRE* represents the ratio of the forecasting absolute error to the observed value.

Before presenting the comparison with other methods, we inspect the results obtained by the proposed strategy for each historical window value used ( *w*) and each subproblem (*h*). Figure 3 shows a graphical representation of the results obtained, showing the associated MRE for different values of *w*, when varying the length of *h*. We can see that the best results were achieved when the forecasting is based on more historical data, i.e., for higher values of *w*. In fact, the best results were obtained for *w* = 168. Analogously, the MRE increases as *h* becomes longer. The proposed strategy obtains similar results for *w* = {168, 144, 120} on all the considered values of the prediction horizon *h*.

**Figure 3.** Results obtained for each value of *h* and *w*.

It can be noticed that there is a significant increment in the error when the historical window size is lower. In particular, when *w* is set to 24 or 48, the predictions degenerates evidently. We can also notice that performances of the proposed strategy deteriorates, i.e., the achieved MRE is higher, as the values of *h* increase. This means that it is more difficult to predict further in the future.

Table 4 shows the parameters selected by the GA for each *h* when a historical window of 168 was used. We can notice that the number of layers range between 27 and 98, and the number of neurons per layer between 478 and 942. It does not seem that this parameter is connected with the value of *h*.

Parameters *λ*, *ρ* and assume almost the same values on all the cases, while the *Maxout* is the activation function mostly chosen. The GA selected two possibilities as distribution functions, namely the *Gaussian* and the *Huber* function. The end metric selected, on the other hand, presents more variations. This could sugges<sup>t</sup> that we could perhaps fix some of the parameters, e.g., , in order to reduce the search space.


**Table 4.** Parameters found by the GA for *w* = 168.

As previously stated, in order to globally assess the performance of our proposal, we compared the results achieved by our methodology (NDL) with the results obtained by other strategies commonly used for time-series forecast. In particular, we considered Random Forest (RF), Artificial Neural Networks (NN), Evolutionary Decision Trees (EV), the Auto-Regressive Integrated Moving Average (ARIMA), an algorithm based on Gradient Boosting (GBM), three Deep Learning models (FFNN, Feed-Forward Neural Network; CNN, Convolutional Neural Network; LSTM, Long Short-Term Memory), decision tree algorithm (DT) and an ensemble strategy that was proposed in [14], which combined regression trees-based, artificial neural networks and random forests (ENSEMBLE).

For ARIMA, we used the tool in Ref. [52] for determining the order of auto-regressive (AR) terms (*p*), the degree of differencing (*d*) and the order of moving-average (MA) terms (*q*). The values obtained are *p* = 4, *d* = 1 and *q* = 3. The value for the auto-regressive parameter and the degree of differencing confirm that the time-series is not stationary, as indicated in Section 3.1.

The deep learning models were designed using *H*2*O* framework of R [49]. The difference between NDL and DL, is that in the latter case, the network is trained with stochastic gradient descend using back-propagation algorithm. In order to set the parameters for DL, we used a grid search approach. As a consequence, we used a hyperbolic tangent function as activation function, the number of hidden layer was set to 3 and the number of neurons to 30. The distribution function was set to Poisson and in order to avoid overfitting, the regularization parameter (Lambda) has been set to 0.001. The other two parameters (*ρ* and ) were set as default as in [36].

The DT algorithm is based on a greedy algorithm [53] that performs a recursive binary partitioning of the feature space in order to build a decision tree. This algorithm uses the information gain in order to build the decision trees, and we used the default parameter as in the package *rpart* of R [54].

For the GBM, we used the GBM package of R [55] with Gaussian distribution, 3000 gradient boosting interactions, learning rate of 0.9 and 40 as maximum depth of variable interactions.

For RF, we used the implementation from provided by the randomForest package of R [56], using 100 as the number of trees to be built by algorithm and 100 as the maximum number of terminal nodes trees in the forest can have.

For ANN we used the nnet package of R [57], with maximum 10 number of hidden units, 10,000 maximum number of weights allowed and 1000 maximum number of iterations.

EV is an evolutionary algorithm for producing regression trees, and we used the R evtree package (from now on EVTree) [58], with parameters as in [14].

The ensemble method [14] uses a two layer strategy, where in the first layer random forests, neural networks and an evolutionary algorithm are used. The results produced by these three algorithms are then used by an algorithm based on Gradient Boosting in order to produce the final prediction.

All the parameters of the ML based techniques were established after several preliminary runs.

Table 5 shows the results obtained by the various methods for each value of *w*. We can notice that all the methods obtained better results with a historical window of 168 reads. NDL obtained the lowest MRE in all the cases, while the ensemble strategy obtains the second best results. Moreover, we can see that NDL outperforms all other methods even when only a historical window of 96 is used, confirming the extremely good performances of such strategy.

**Table 5.** Average results obtained by different methods for different historical window values. Standard deviation between brackets.


It is interesting also to notice that NDL obtains better results than DL for all the values of the historical window used, which confirms that using an evolutionary approach for optimizing the parameters of the deep learning network can be considered as a superior strategy with respect to grid optimization.

#### **5. Conclusions and Future Works**

In this paper, we proposed a strategy based on neuroevolution in order to predict the short-term electric energy demand. In particular, we used a genetic algorithm in order to obtain the architecture of a deep feed-forward neural network provided by the H2O big data analysis framework. The resulting networks have been applied to a dataset registering the electric energy consumption in Spain over almost 10 years.

The results were compared with other standard and machine learning strategies for time-series forecasting. For the experimentation performed we can conclude that the methodology we proposed in this paper is efficient for short-term electric energy forecasting, and on the particular dataset used in this paper the proposed strategy obtained the best performances. It is interesting to notice that our proposal outperforms the other ten strategies in all the cases, and that even when a historical window of 96 reads was used, our proposal achieved more precise predictions than any other methods with any other historical window size.

As for future work, we intend to apply the framework proposed in this paper to other datasets, and also to other kinds of time-series, in order to check the validity of our proposal also in other fields. Moreover, we intend to overcome a present limitation of the current proposal. In fact, the R GA package we have used does not allow to optimize parameter of different types, e.g., real and integer parameters. In order to overcome this, in this proposal we had to treat all the parameters as real. However, this causes the search space dimension to increase drastically. In the future we intend to solve this problem as well, and by reducing the size of the search space, we are confident that better configurations of the deep learning can be found. The use of on-line learning will also be explored in future works in order to speed up the prediction process and reduce the volume of stored data.

**Author Contributions:** F.D. conceived and partially wrote the paper. J.F.T. launched the experimentation. M.G.-T. and F.M.-Á. addressed the reviewers comments. A.T. validated the experiments. All authors have read and agree to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors would like to thank the Spanish Ministry of Science, Innovation and Universities for the support under project TIN2017-88209-C2-1-R. This work has also been partially supported by CONACYT-Paraguay through Research Grant PINV18-661.

**Conflicts of Interest:** The authors declare no conflict of interest.
