*2.5. Statistical Methodology*

The choice of the multiple linear regression method was based on its strength as a statistical data handling tool and its simplicity in development, implementation and operation. The latter is crucial if the building owners and the industry are to be able to minimize the energy use, related to undesired operation, over a short period of time. Regarding practical issues, the developers (the engineers) recognize the method in their university education and the operation management can easily evaluate the energy performance in a spreadsheet [41], or it can be easily implemented in any report system, due to its simple algebraic equation.

The dataset was imported and analyzed with IBM SPSS statistical software [48].

#### 2.5.1. Multiple Linear Regression

The MLR method was used to predict the dependent variable *y*, here the total power consumption, averaged over a certain period. This period was taken to be sufficiently long so that the method only focused on physical effects as processes in the steady state for each time step. The regression equation was trained by the ordinary least square method where the sum of the root square error was minimized. The corresponding regression coefficients, *β*<sup>0</sup> and *βi*, were determined. These comprised the slope coefficient for the independent variables.

$$y\_i = \beta\_0 + \beta\_1 \mathbf{x}\_1 + \beta\_2 \mathbf{x}\_2 + \dots + \varepsilon \tag{2}$$

where *yi* is the dependent variable, *β*<sup>0</sup> is the intersection with the *y*-axis when *x* is zero, *β<sup>i</sup>* is the regression slope coefficient in the linear equation, *xi* is the predictor—the independent variable—and is the error term.

#### 2.5.2. Assumptions

In the development of the model, several assumptions were adopted. The data source was time series data, and, initially, its autoregressive properties or the order of the autoregressive process were not known. These were identified by applying the partial autocorrelation function (PACF), which specifies the number of past lags influencing the dependent variable (i.e., the order of the autoregressive process). The application of the PACF in time series analysis is analogous to deciding the number of independent variables to be included in a multiple linear regression analysis [49]. The dataset was initially investigated for autoregressive properties and reduced by averaging the data and centered in time to eliminate any autoregressive properties in the dependent variable. Each observation in the training dataset was then treated as independent.

#### 2.5.3. Evaluation of the Prediction Model

The "goodness of fit" was evaluated by the coefficient of determination *R*<sup>2</sup> and the adjusted *R*2, which considers the number of explanatory variables and the possibilities of overfitting. *R*<sup>2</sup> is defined by the relationship between the explained sum of squares and the total sum of squares.

The multiple linear regression equation was validated by analyzing the variance with the F-test. The test operator, F, which is defined by the ratio between the explained sum of squares and the residual sum of squares, was applied to the F-distribution. A significance level of 5% was chosen as the required level.

The coefficients in the equation, the impacts of the independent variables, were evaluated by applying the T-statistic, with the *t*-test, which is similar to the F-test, but which describes the probability of nonlinear correlation by applying the test operator to the Tdistribution. The test operator is defined by the relation between the coefficient and its standard error.

The fundamental assumptions for using linear regression were investigated, such as a lack of multicollinearity, no heteroskedasticity, normally distributed residuals and no autocorrelation among the residuals [50], which were fulfilled for each case in the presented analysis. The multicollinearity among the variables was investigated by manually applying the independent variables in a correlation matrix. Potential heteroskedasticity was evaluated visually. The autocorrelation among the residuals was tested with the Durbin–Watson statistic, which assumes a maximum lag of one. The lag of the residuals was investigated by determining the autoregressive process by applying the PACF.

#### 2.5.4. Validation

The prediction model was tested and validated by comparing the prediction and measurement for the whole validation dataset. The criteria for a passed validation process were defined as (1) all the measurements identified as normal operation should be predicted within the prediction interval defined in the training process and (2) all of the operation disruptions should be clearly identified by the validation process.
