3.1.1. Artificial Neural Network

There are many works that describe the ANN [26–28], and we aim only to present its basic structure and how it works. The ANN, also called the multilayer perceptron (MLP), typically comprises an input layer, single or multiple hidden layers, and a single or multiple output layer(s) (depending on the specific application), with each layer comprising a different number of nodes as typically represented in Figure 1.

**Figure 1.** Representations of an ANN: (**a**) Single-layer (**b**) Multi-layer.

The input data, which contain the independent variables, also called features or attributes, are denoted as *x*1, *x*2, ··· , *xn*, whereas the output (i.e., dependent) variable is denoted as *y*˜. The weights connecting the input and hidden nodes are denoted as *w*1, *w*2, ··· , *wn*. The ANN aims to minimize the error, which is the difference between the correct *y* and the predicted values *y*˜ via a cost function [26]. The cost function computes this error, wherein the term "cost" refers simply to the error. The steps taken by the ANN can be summarized as follows, but in-depth details can be found in [27,29]:

1. The dot product between the inputs and weights is computed. This involves multiplying each input by its corresponding weight and then summing them up along with a bias term *b*. This is obtained as

$$Z = \sum\_{i=1}^{N} w\_i x\_i + b \tag{1}$$

2. The summation of the dot products is passed through an activation function. The activation function bounds the input values between 0 and 1, and a popular function, which we used in our study, is the sigmoid activation function, stated mathematically as

$$\phi(Z) = \frac{1}{1 + \varepsilon^{-Z}} \tag{2}$$

The sigmoid function returns values close to 1 when the input is a large positive value, returns 0 for large negative values, and returns 0.5 when the input is zero. It is best suited for predicting the output as a probability, which ranges between 0 and 1, which makes it the right choice for our forecasting problem. The result of the activation function is essentially the predicted output for the input features.

3. Backpropagation is conducted by first calculating the cost via the cost function, which can simply be the mean square error (MSE) given as

$$MSE = \frac{1}{N} \sum\_{i}^{N} (\vec{y}\_i - y\_i)^2 \tag{3}$$

where *yi* is the target output value, *y*˜*<sup>i</sup>* is the predicted output value, and *N* is the number of observations (also called instances). Then, the cost function is minimized, where the weights and the bias are fine tuned to ensure that the function returns the smallest value possible. The smaller the cost, the more accurate the predictions. Minimization is conducted via the gradient descent algorithm, which can be mathematically represented as

$$\mathcal{W}\_x^\* = \mathcal{W}\_x - a(\frac{\partial Error}{\partial \mathcal{W}\_x}) \tag{4}$$

where *W*∗ *<sup>x</sup>* is the new weight, *Wx* is the old weight, *a* is the learning rate, and *<sup>∂</sup>Error <sup>∂</sup>Wx* is the derivative of the error with respect to the weight, where *∂Error* is the cost function. The learning rate determines how fast the algorithm learns. The gradient descent algorithm iterates repeatedly (called the number of epochs) until the cost is minimized. Consequently, the steps followed can be summarized as follows:


Following the preceding steps, the ANN's hyperparameters can be fine-tuned using the GridsearchCV method, with details of using the GridsearchCV well documented in [30]. The number of neurons, activation function, learning rate, momentum, batch size, and epochs are among the hyperparameters fine tuned in our study.
