3.1.2. Linear and Gaussian Regression

When making day-ahead energy demand and supply predictions, we are often faced with a single input variable system, and thus, a simple linear regression model can be used for prediction purposes. Here, the model comprises an input or predictor variable that helps to predict the output variable, and this is represented by a simple linear equation. However, for generalization sake, the idea behind the regression is to estimate from a sample the parameters of the model generally written as [31]

$$\mathcal{Y} = \beta\_0 + \beta\_1 \mathbf{x}\_1 + \dots + \beta\_N \mathbf{x}\_N + \varepsilon \tag{5}$$

where *y*ˆ is the predicted output, *β*1, *β*2, ··· , *β<sup>N</sup>* are the parameters (i.e., the model coefficients), *x*1, *x*2, ··· , *xN* are the input variables (or features), and  is a random error with  ∼ N (0, *<sup>σ</sup>*2), where *<sup>σ</sup>*<sup>2</sup> is the variance value. By determining these parameter values (i.e., *β*), a line of best fit can be obtained and used for prediction purposes. The method of ordinary least squares can be used for parameter estimation, and this involves minimizing the squared differences between the target and predicted outcomes (i.e., the residuals) [31]. The sum of squares of the error, termed the residual sum of squares (RSS), is computed as *RSS* = ∑*<sup>N</sup> <sup>i</sup>* (*yi* <sup>−</sup> *<sup>y</sup>*ˆ*i*)2, which can then be minimized using, for example, the gradient descent algorithm instead of the ordinary least squares approach. The gradient descent algorithm commences with sets of initial parameter values and advances iteratively toward a set of parameter values that minimize the function. The iterative minimization is accomplished via derivatives, which involves taking steps in the negative direction of the function gradient.

However, in using the linear regression approach, it is essential that we make assumptions regarding the structure of the function to be used, for example, by making a choice as to which is a better fit: a linear or a quadratic function. Such a choice can be independently decided upon by certain methods, such as the Gaussian regression (GR) (also called Gaussian process regression) approach [32]. Essentially, the GR generates a number of candidate functions that can model the observed data, and it attempts to find the function that best fits the data. Such a best fit function is then used for predicting future occurrences. The main difference between the GR and LR is that the GR uses a kernel, which typically represents the covariance matrix of the data [33]. Thus, the choice of the kernel function often influences strongly the performance of the GR algorithm. Further theoretical details regarding the GR algorithm can be found in [34]. The hyperparameters of the LR and GR fine tuned in this study include the attribute selection method, kernel and the filter type.
