*3.2. Statistical Analysis—Developing the Model*

Since the training dataset consists of operation data from the first period after the building was commissioned, several irregularities may occur. By detecting and excluding observations associated with irregular operation events, the training dataset is optimized to only represent flawless operation. A predictive model trained by this dataset should be able to provide accurate predictions.

By investigating historical operating data from both the BAS and the internal control system of the air handling unit, a major change in operation was found. The consequence of this is illustrated in Figure 8, which depicts the thermal load for the pool heating system, where a change in operation is identified in late March 2018. The reason for the considerable change was issues related to the control of the integrated dehumidification system and the pool temperature, possibly a problem with a mixing valve. However, since this flaw in the operation has implications for both the pool temperature and the heat recovery

system, the whole period from 25 October 2017–22 March 2018 must be excluded from the training dataset.

**Figure 8.** Thermal load for the pool circuit, plotted against the timeline, in averaged 1-day time step resolution.

#### 3.2.1. New Training Dataset

By excluding the period associated with operational irregularities, prior to 22 March 2018, the prediction model was developed. The new training dataset, ranging from 22 March to 24 June 2018, consisted of three-day averaged values, for a total of 29 datapoints. The analysis of the autoregressive properties of the dataset showed no autocorrelation when averaging data for 72 h, or 3 days.

The results of the regression analysis are expressed in Equation (3). The key output from the regression analysis is given in Table 2. Regarding possible problems with overfitting, 15 datapoints per predictor are recommended [52] to obtain reliable fitted regression, which means a maximum of two predictors for a dataset of this size. The two independent variables which are found to explain most of the variance are the outdoor dry-bulb temperature (*Tout*) and the pool usage factor (*tpu*) (see description of variables in Table 1). This combination has a statistical effect on the energy use, with almost similar impact, and both were identified by a significance level *p* < 5%. The chosen combination of variables is in accordance with the physics, where the outdoor temperature represents the thermal losses through the envelope and ventilation, and the pool usage represents the water usage and the operation mode of the facility. The number of swimmers was not found to have a statistical effect on the overall power consumption, despite the impact of evaporation on the energy use. This may be explained by the phenomenon of evaporation, which is observed as a step function where a few bathers have a significant impact, but a further increase only gives a small additional contribution to evaporation [53]. However, the combination of weather conditions and usage/occupancy is also found to have a statistically significant effect on energy use in office buildings [38], despite the difference between these building categories.

$$E\_{tot} = 14.715 - 227.8T\_{out} + 24.790t\_{pu} \tag{3}$$

where *E˙tot* is the predicted power consumption [Watt], *Tout* is the outdoor temperature [°C] and *tpu* is the pool usage factor.


**Table 2.** Key outputs from the regression analysis.

The ability of the model to explain the variance is given by *R*<sup>2</sup> = 87%. The ability of the prediction model to reproduce the power consumption is illustrated in Figure 9, where the predicted power consumption is plotted along with the training data, the actual power consumption and the corresponding prediction interval. The prediction interval of 95% is the interval where there is 95% confidence of there being an observation within it. It depends on factors like sample size, number of predictors and the significance level. For the range of independent variables given in the training dataset, the mean prediction interval is identified to be ±1.86 kW. Figure 10 shows the linear relationship between the training dataset and the data produced by the prediction model where the Pearson correlation coefficient is 0.93.

Regarding the fundamental assumptions in linear regression, the residuals from the training process, given in Figures 11 and 12, are approximately normally distributed. There are no signs of heteroskedasticity and the residuals are represented with a mean value of approximately 0. The autoregressive process is not found to be on an order higher than 1, but the Durbin–Watson coefficient is approximately 1.4, which possibly indicates some autocorrelation. However, the possible autocorrelation, or the lack of autocorrelation, is not found to be statistically significant. The regression equation is considered to be reliable within the given goodness of fit.

**Figure 9.** The predicted power consumption plotted against the training data and with the corresponding prediction interval.

**Figure 10.** The predicted power consumption plotted against the measured power consumption. The Pearson correlation coefficient is given as the R-coefficient.

**Figure 11.** The distribution of the residuals.

**Figure 12.** Residuals plotted by power consumption.
