2.3.1. Models

The aim of this work is to build a TCN-based deep learning model to improve the performance in energy demand forecasting tasks, in terms of both accuracy and efficiency. In order to compare the effectiveness of TCNs for this problem, we also evaluate the performance of recurrent LSTM networks—that have so far been considered the state-of-the-art in forecasting. However, given the high complexity of deep learning models, finding optimal values for the hyperparameters of these networks is a very challenging task. Therefore, we have conducted an extensive experimental study that involves more than 1900 combinations of parameters that build different convolutional and recurrent architectures. An important hyperparameter that is common to both types of architectures is the size of the input window. The possible values for past history were defined in Section 2.1 and depend on the data characteristics and seasonality. Additionally, we have searched for the best values of several parameters that are specific for TCN or LSTM architectures. In the case of TCNs, we have experimented with a different number of filters and stacked residual blocks, kernel sizes, and dilations factors. In the case of LSTMs, we have experimented with a different number of stacked layers and units. Furthermore, we have also studied the effect of training parameters in the performance of all models, such as the batch size and the number of epochs. The Adam optimiser has been selected for training the models, which has an adaptive learning rate that can improve the convergence speed of deep networks [41]. The mean absolute error (MAE) has been used as the loss function for all experiments.

Table 1 displays all TCN architecture configurations that have been tested over both datasets. The parameter search process has been designed considering the receptive field of neurons inside the network, that can be calculated as follows: (*receptive field* = *no*. *stacked blocks* × *kernel size* × *last dilation f actor*). Depending on the length of the past history window, we carefully select possible values for kernel size, stacked blocks, and dilations so that the receptive field covers the whole input sequence. For instance, if the number of stacked block increases, less dilated convolutional layers are needed, as can be seen in Table 1a,b. All these architectures are then tested with all combinations of parameters displayed in Table 1c that are common for both datasets (number of convolutional filters, epochs, and batch size). Overall, 756 (28 models from Figure 7a × 3 numbers of filters × 3 batch size ×

3 number of epochs from Table 1c) experiments with different configurations using TCNs have been conducted for the electric demand dataset, and 513 (12 from Figure 7b × 3×3×3 from Table 1c) for the EV power consumption data.

**Table 1.** Architecture configuration of all TCN models for each dataset and common parameter search.

**a** TCN architectures depending on the past history for electric demand data. **b** TCN EV power



Table 2 presents the parameters that have been studied for LSTM networks. Given the sequential processing nature of these networks, they can effectively cover the complete input sequence and capture long-term dependencies. Therefore, the parameter search, in this case, is based on trying different combinations of LSTM units that can process sequences and feed the output to subsequent stacked layers. We also consider the same possible values as above for the input window length. A total of 243 (3 past history × 3×3×3 from Table 2) experiments with different LSTM models have been carried out for each dataset.



#### 2.3.2. Evaluation Metric

For evaluating the predictive performance of all models we use the weighted absolute percentage error (WAPE). This metric has been suggested by recent studies dealing with energy demand data [31]. Electric industries are interested in knowing the deviation in watts for better load generation planning. Therefore, WAPE is very suitable for this context since it provides absolute error values. WAPE can be defined as follows:

$$WAPE(y, o) = \frac{MAE(y, o)}{mean(y)} = \frac{mean(|y - o|)}{mean(y)},\tag{2}$$

where *y* and *o* are two vectors with the real and predicted values, respectively, that have a length equal to the forecasting horizon.

#### **3. Results and Discussion**

This section reports and discusses the results obtained from the experiments carried out with the different model architectures presented in the previous section. For all tests, we have used a computer with an Intel Core i7-770K CPU and a NVIDIA GeForce GTX 1080 8GB GPU. The source code and the complete experimental results report can be found at [42].

## *3.1. Forecasting Accuracy*

Figure 6 shows a comparison between the overall performance of TCN and LSTM models. It presents the distribution of the results obtained for each dataset with all architectures depending on the past history window length. In general, it can be seen that TCN models achieve a better predictive accuracy compared to LSTM networks. In almost all cases, groups of different TCN architectures with the same input window have a smaller deviation. This implies that TCN models are less sensitive to the hyperparameter selection as long as the past history remains fixed. The most robust performance is given by TCN architectures using input windows of 288 for the electric demand data and 168 for the EV power demand data. Only in the case of the EV power consumption using a very long past history (672 h) do TCN models struggle to ge<sup>t</sup> a stable performance. Using longer input windows implies more trainable parameters and can complicate the learning procedure of very deep convolutional networks. With respect to the LSTM models, the worst results are obtained when using the largest history size. This suggests that the proposed recurrent networks are not able to efficiently process long input sequences, and can better capture temporal dependencies using smaller windows. The difference in performance between TCN and LSTM is higher in the electric demand dataset, which is also the longest time series. This indicates that the acquisition of high volumes of data is a fundamental step in order to obtain robust TCN-based deep learning models.

Table 3 presents the TCN and LSTM architecture that obtained the best WAPE result for each past history value. The highest accuracy for both datasets has been obtained with very similar TCN architectures. In the case of the electric demand data, the best result (0.0093 WAPE) has been obtained using 48 h (288 time-steps) as past history. This TCN architecture, which is represented in Figure 7a, has two residual blocks with five convolutional layers of kernel size 6, 128 filters, and dilations of 1, 3, 6, 12, and 24. This model has been trained over 50 epochs with a batch size of 128 instances. In contrast, the best LSTM model (0.0105 WAPE) for this dataset uses the smallest possible input window, which is 24 h. This LSTM model consists of 2 stacked layers with 128 units, and was trained over 50 epochs using 64 as batch size. In general, for this dataset, it can be seen that the best results have been provided by stacking two or three layers that use the greatest amount of filters or units (128). For TCNs, the highest kernel size (6) has proved to be the most effective for capturing local trend patterns in this long time series with daily and weekly seasonality. Furthermore, TCN blocks needed at least four convolutional layers with increasing dilation factors to achieve the highest accuracy. Both types of network achieve better results when training for 50 epochs, which sugges<sup>t</sup> that an excessive number of iterations may cause overfitting issues. Concerning the batch size, the optimal value is 128 for almost all cases. Deep networks lose generalisation capacity when trained using large batches since they often converge to sharp minimisers [43], hence choosing smaller values can be beneficial.

(**a**) Results for the electric demand data.

**Figure 6.** Distribution of weighted absolute percentage error (WAPE) results for all architectures depending on the past history window.

**Table 3.** WAPE results of the best TCN and Long Short-Term Memory (LSTM) models for each possible value of input window.


**a** Best results for electric demand data.


**b** Best results for the EV power consumption data.

For the electric vehicle power consumption data, the best result (0.4228 WAPE) has been obtained with a past history of 14 days (336 time-steps). This TCN architecture, which is represented in Figure 7b, has two residual blocks with six convolutional layers of kernel size 3, 128 filters, and dilations of 1, 5, 7, 14, 28, and 56. This model has been trained over 100 epochs with a batch size of 64 instances. Similar to the previous dataset, the best LSTM model (0.04317 WAPE) uses the smallest possible input window, which is 7 days. This LSTM model consists of 2 stacked layers with 128 units and was trained over 50 epochs using 128 as batch size. Given the different nature of this dataset, the best configurations of architectures present several differences. For TCNs, a smaller kernel was able to extract the underlying patterns more accurately. Furthermore, a smaller batch size was better in almost all cases since the EV time series has fewer instances to train with.

(**a**) Best TCN architecture for the electric demand data.

(**b**) Best TCN architecture for the EV data.

**Figure 7.** Architecture of the best TCN model for each dataset.

In addition, Figure 8 shows the evolution of the training and validation loss for the best TCN and LSTM models. These plots can help to compare the learning process of the different architectures presented in Table 3. It can be seen that TCNs have a more stable loss optimisation procedure compared to the LSTM models, which can be associated with the use of the standard backpropagation method. The training curves of convolutional networks suffer fewer oscillations than the recurrent approach. For the electric demand dataset, TCN models converge faster to a lower validation loss. However, for the EV power demand data, LSTMs have an inferior validation loss at the initialisation point and hence converge more rapidly.

In general, the obtained results demonstrate that TCN models can outperform the forecasting accuracy of LSTM models. TCNs have shown a more reliable performance regardless of the selected architecture and parametrisation. The dilated casual convolutions employed by TCNs were better at capturing long-term dependencies than recurrent units. Other important conclusions can be drawn from the experimental results of this study. As can be seen, the best LSTM models for both datasets use the smallest possible input window. This indicates that the sequential processing of recurrent networks is not optimal for dealing with very long input sequences. In contrast, the use of residual connections in TCNs allows to increase the depth of the network and effectively encode longer sequences. Furthermore, another aspect to consider is that the best performance was obtained when stacked layers used as many filters or units as possible.

**Figure 8.** Evolution of the training and validation loss function of the best TCN and LSTM models for each possible value of past history. First row corresponds to the electric demand data (**<sup>a</sup>**–**<sup>c</sup>**), while the second row corresponds to the EV power consumption data (**d**–**f**).

## *3.2. Computation Time*

The deep learning models proposed in this study have complex architectures that involve a high computational cost. Therefore, it is also essential to evaluate the models in terms of computational efficiency. Table 4 presents the training and prediction time for all the best TCNs and LSTMs that were presented in the previous subsection. It also reports the number of trainable parameters to further enhance the comparative study. For obtaining these computation times, the batch size has been fixed to 128 instances for all models in order to perform fair comparisons. As can be seen, TCNs have far more trainable parameters since their stacked blocks have many convolutional operations. This results in higher training time compared to the LSTM models. The differences in training times are slightly larger in the case of the electric demand data, considering that an epoch of this dataset involves many more instances than the EV data. However, with respect to prediction time, TCN models are always faster than recurrent networks. Regardless of the considerably higher number of parameters, TCNs are able to provide very quick forecasts once they are trained. As it was expected, parallel convolutions of TCNs can process long input sequences faster than the sequential processing of recurrent networks. This seems an essential advantage for real-time applications, in which predictions need to be obtained as soon as possible in order to make informed decisions.

**Table 4.** Computation time of the best TCN and LSTM models for each dataset. Prediction time corresponds to the time that a model takes to compute a prediction for one instance. Training time corresponds to the time that a model takes to run one training epoch.


**a** Computation time for electric demand data.

**b** Computation time for EV power consumption data.

