3.4.1. Statistical Modeling Frameworks

To consider the non-linearity of the EAF process, a non-linear statistical modeling framework must be used. The numerical experiments in this study will use two different non-linear statistical modeling frameworks; ANN and RF. ANN has been used in a previous article using the same methodology as described in this paper [2]. RF will be used to broaden the scope of models used to predict the EE consumption using the same methodology. Furthermore, SHAP interactions, an interpretable machine learning algorithm that calculates the interactions between the input variables with respect to the output variable can be used for RF models [20]. This is not the case for ANN models where only regular SHAP values can be used. SHAP will be further explained in Section 3.5.1.

**Artificial Neural Networks:** This model framework uses a fully connected network of nodes to make predictions [21]. The first layer, which is known as the input layer, receives the values from the input variables. The values are then propagated through the intermediate layers, which are known as hidden layers, to the last layer. The last layer is the output layer where the prediction is made. See Figure 5 for an illustration of an arbitrary ANN model.

**Figure 5.** An Artificial Neural Network (ANN) for predicting an output value based on two input values [2]. It has one hidden layer with three nodes. The lines between the nodes illustrate that the ANN is fully connected and the forward flow of calculations in the network.

By changing the number of nodes in the hidden layers and the number of hidden layers, one can alter the complexity of the model. Increasing the number of hidden layers and nodes increase the complexity and enable the model to learn more complicated relations between the input variables and the output variable.

In the output layer, and in each of the hidden layers, each node multiplies a weight value with each of the value propagated by the previous layer. The resulting weight-value factors are summed together. Mathematically, this process can be expressed as:

$$s\_{\bar{j}} = \sum\_{i=1}^{P} w\_i \cdot x\_i \tag{6}$$

where *j* is the *j*:th node in the current layer and *P* is the number of nodes in the preceding layer.

Each value, *sj*, is then fed into an activation function, and the resulting value is propagated to the next layer in the network. Two commonly used activation functions are the hyperbolic tangent (tanh) and the logistic sigmoid.

During the training phase, the weights are updated in the direction of minimizing the overall loss of the predictions on the training data. Since the output variable is connected to the input variables by the network weights, it is possible to mathematically express the loss as a function of the network weights. Finding an optimal local minimum, with respect to the overall loss, in the weight space requires a sophisticated algorithm. These algorithms are known as gradient-descent algorithms as their function is to descent to the most optimal local minima in loss space [21].

Given enough hidden layers and nodes, an ANN can learn any complex relationship between variables even though the relationships are not valuable for prediction purposes. This overfitting phenomenon can be reduced by splitting the training data into two sets. The first set of data is used to adapt the weights while the other set is used to calculate the loss after each weight update.

**Random Forest:** This statistical modeling framework is a model made of two, or more, decision trees. See Figure 6 for an illustration of a simple decision tree for prediction purposes. The RF model framework was first reported by L. Breiman [22]. RF belongs to the statistical model group known as ensemble models, which is a group of statistical models that is made up of two or more models that when combined, aim to increase the prediction accuracy.

**Figure 6.** A simple decision tree sorting points on the {*x*, *y*, *z*} coordinate system. Points satisfying the conditions on each node proceed on the left branch, the rest proceed to the right branch. The points *A* = {0, 42, 5}, *B* = {−10, −10, −10}, *C* = {31, 4, 0}, and *D* = {2, 4, 61}, are sorted in the decision tree.

In an RF model, each decision tree is trained on a sub-sample of the complete training data set. This sub-sample is drawn *with replacement* from the complete training data set, a process known as bootstrapping. By training each decision tree on a sub-sample of the training data set the overfitting of the RF model is reduced, since each decision tree becomes specialized on one segment of sample space. Furthermore, the optimal split for the next branch from each node in the decision tree is selected using a random selection of a pre-specified number of the total available input variables. This procedure also reduces overfitting since each decision tree now has a higher probability of being diverse with respect to the other trees in the model. Using many trees created by a random selection of features and data points, RF type models have proven to converge such that overfitting does not become a problem [22]. In the prediction phase of an RF model, each decision tree predicts the value of the output variable. In a prediction of regression type, i.e., when predicting continuous values, the prediction by the RF model is the average of the predictions from all decision trees. The prediction of one data instance, *xk*, in an arbitrary RF model is illustrated in Figure 7.

To optimize an RF model for the task at hand, i.e., improve the accuracy, one must search for an optimal combination of hyper-parameters. The most important hyper-parameters are the maximum tree depth, i.e., number of splits from the root node, the number of decision trees, and the maximum number of features used to find the optimal condition when splitting a node [23].

**Figure 7.** An arbitrary RF model consisting of 6 decision trees. The prediction, *yk*, of the data instance, *xk*, is determined by averaging the sum of the outputs from the decision trees. The filled nodes show the path *xk* has taken in each decision tree.
