*2.2. Methodology*

In this subsection, we describe the required data preprocessing steps and the fundamental concepts behind temporal convolutional networks.

## 2.2.1. Data Preprocessing

In order to train a deep learning model that can predict several time-steps, a preprocessing stage is needed to transform the original time series data. First, we perform min-max normalisation to the entire sequence to scale the values between 0 and 1, which helps to improve the convergence of deep networks. Secondly, we transform the sequence into instances that can be used to feed the network. There exist several strategies to deal with multistep forecasting problems [32]: the recursive strategy, which performs one-step predictions and feeds the result as the last input for the next prediction; the direct strategy, which builds one model for each time step; and the multi-output approach, which outputs the complete forecasting horizon vector using just one model. As suggested in recent forecasting studies that use neural networks [33,34], in this work, we adopt the MIMO strategy (Multi-Input Multi-Output) which belongs to the last category. Instead of forecasting each time-step independently, the MIMO approach can model the dependencies between the predicted values since it outputs the complete forecasting window. Furthermore, this strategy avoids the accumulated errors over predictions that appear in the recursive strategy.

Following this approach, a moving window scheme is used to create the input–output pairs that will be fed to the neural network. All deep learning models used in this study accept a fixed-length window as input and have an output dense layer with as many neurons as the forecasting horizon defined for each problem (24 for electricity demand and 48 for electric vehicle demand). Figure 3 illustrates the process of applying the moving window over the complete time series. As can be seen, the window slides and obtains an input–output instance at each position. While the output window size is defined by the problem, the input window size has to be decided. The optimal value can be different depending on the data, the designed model, and the forecasting horizon. In our study, we have experimented with three different sizes for the input window of each problem. The values have been carefully selected, considering the characteristics and seasonality of the datasets. For the electricity demand, we evaluate using 144, 168, and 268 time-steps as input window (which corresponds to 24, 28, and 48 h, respectively). For the power demand of electric vehicles, we consider 168, 336, and 672 time-steps as input window (which corresponds to 7, 14, and 28 days, respectively).

**Figure 3.** Moving window procedure that obtains the input–output instances. In this example, the input and output windows have lengths of 7 and 3, respectively.

#### 2.2.2. Temporal Convolutional Neural Network

TCNs are a type of convolutional neural network with a specific design that makes them suitable for handling time series. TCNs satisfy two main principles: the network's output has the same length as the input sequence (similarly to LSTM networks); and they prevent leakage of information from future to the past by using causal convolutions [24]. Causal convolution differs from standard convolution in the fact that the convolutional operation performed to obtain the output at time *t* does not take future values as inputs. This implies that, using a kernel size *k*, the output *Ot* is obtained using the values of *Xt*−(*k*−<sup>1</sup>), *Xt*−(*k*−<sup>2</sup>), ... , *Xt*−1, *Xt* (Figure 4). Zero-padding of length *k* − 1 is used at every layer to maintain the same length as the input sequence.

(**a**) Standard convolution block with two layers with kernel size 3.

(**b**) Causal convolution block with two layers with kernel size 3.

(**c**) Dilated causal convolution block with two layers with kernel size 2, dilation rate 2.

**Figure 4.** Differences between (**a**) standard convolutional network, (**b**) causal convolutional network, and (**c**) dilated causal convolutional network.

Furthermore, with the aim of capturing longer-term patterns, TCNs use one-dimensional dilated convolutions. This convolution increases the receptive field of the network without using pooling operations, hence there is no loss of resolution [35]. Dilation consists of skipping *d* values between the inputs of the convolutional operation, as can be seen in Figure 4c. The complete dilated causal convolution operation over consecutive layers can be formulated as follows [36]:

$$\mathbf{x}\_{l}^{t} = \mathbf{g} \left( \sum\_{k=0}^{K-1} w\_{l}^{k} \mathbf{x}\_{(l-1)}^{(t-(k \times d))} + b\_{l} \right), \tag{1}$$

where *xtl* is the output of the neuron at position (*t*) in the *l*-th layer; *K* is the width of the convolutional kernel; *wkl* stands for the weight of position (*k*); *d* is the dilation factor of the convolution; and *bl* is the bias term. Rectified Linear Units (ReLU) layers are used as activation function (*g*(*x*) = *max*(0, *x*)) [37]. Another common approach to further increase the network's receptive field is to concatenate several TCN blocks, as can be seen in Figure 5 [38]. However, this leads to deeper architectures with many more parameters which complicates the learning procedure. For this reason, a residual connection is added to the output of each TCN block. Residual connections were proposed by [39] in order to improve performance in very deep architectures, and consist of adding the input of a TCN block to its output (*o* = *g*(*x* + *<sup>F</sup>*(*x*))).

All these characteristics make TCNs a very suitable deep learning architecture for complex time series problems. The main advantage of TCNs is that, similarly to RNNs, they can handle variable-length inputs by sliding the one-dimensional causal convolutional kernel. Furthermore, TCNs are more memory efficient than recurrent networks due to the shared convolution architecture which allows them to process long sequences in parallel. In RNNs, the input sequences are processed sequentially, which results in higher computation time. Moreover, TCNs are trained with the standard backpropagation algorithm, hence avoiding the gradient problems of the backpropagation-through-time algorithm (BPTT) used in RNN [40].

**Figure 5.** Temporal Convolutional Networks (TCN) model with 3 stacked blocks. Each block has 3 convolutional layers with kernel size 2 and dilations [1, 2, 4].

## *2.3. Experimental Study*

In this subsection, we present the design of the experimental study carried out over the two energy-related datasets. Furthermore, we also describe the details of the parameter search process for each model architecture.
