1. Introduction
Increasing societal awareness of environmental challenges has prompted a critical reassessment of our current energy paradigm. There is an urgent need for an energy transition, one that systematically reduces reliance on fossil fuels while enhancing the integration of renewable energy sources (RESs) into the electrical grid [
1].
A clear trend in this context is the shift towards a more decentralised and distributed energy grid. In this model, energy production and consumption are localised, enabling solutions such as self-consumption (SC) or collective self-consumption (CSC). The widespread adoption of photovoltaic (PV) systems—facilitated by the feasibility of rooftop solar panel installations [
2]—has further accelerated the deployment of SC. Nevertheless, the feasibility of SC depends on achieving an acceptable self-consumption ratio (SCR). To ensure this, energy management systems (EMSs) are usually implemented [
3,
4].
Within the SC framework, the aim of the EMS is usually to keep the consumption curve of the building following the production curve for as long as possible by, for example, acting on the flexible loads (FLs) of the building. A common FL used in demand-side management (DSM) or demand response (DR) is the heating, ventilation, and air-conditioning (HVAC) system [
5]. HVAC systems offer a significant margin for energy efficiency improvements [
6,
7] and present a notable advantage over energy storage systems (ESSs), as they are already installed in most buildings, whereas batteries require additional investment and deployment. Despite this, numerous studies have proposed renewable energy-based grids incorporating energy storage systems (ESSs) as a means of energy balancing. For instance, ref. [
8] presents an energy balancing methodology that utilises batteries while also integrating predictive models such as deep learning and support vector regression (SVR).
Recent research has increasingly focused on EMS strategies that incorporate predictive models, demonstrating that integrating future forecasts into control strategies enhances system performance [
9,
10]. As shown in [
11], EMSs integrating predictions can improve optimisation results compared to an EMS that operates without predictions. In the context of SC, incorporating energy consumption and production forecasts into the EMS is especially valuable when the objective is to maximise SCR [
12].
The accuracy of these forecasts, however, is influenced by numerous factors, including weather conditions, building occupancy levels, and user behaviour [
13,
14]. This can lead to prediction errors, which, as demonstrated in [
15], influence the decision-making process of an EMS. Therefore, the predictive models need to be as robust and accurate as possible, as demonstrated in [
4]. Therefore, it is essential to properly select and design a predictive model capable of minimising prediction errors.
There are different types of models that have been proposed in the literature for prediction purposes, being the most popular data-driven predictive models [
16]. Among these, two principal categories stand out: statistical analysis (SA)-based models—particularly autoregressive (AR) techniques—and machine learning (ML) methodologies.
SA models have been extensively evaluated in diverse forecasting applications, consistently yielding robust results. The Box–Jenkins method, which identifies, estimates, and diagnoses mainly autoregressive integrated moving average (ARIMA)-type models, increases the accuracy of predictions [
17]. This is exemplified in [
18], where the electricity price for the next 24 h is predicted through the Box–Jenkins method using data from the previous three days. Using a simple ARIMA model, the electricity price is predicted with a mean absolute percentage error (MAPE) of 3.55%. Seasonal autoregressive integrated moving average (SARIMA) models have also been widely used for energy consumption forecast. This is the case in the work introduced in [
19], where a SARIMA identified by Box–Jenkins method is used to perform short-term load forecasting (STLF) for a single university building. The model is able to forecast weekdays with an MAPE of 18.82%. However, SA models exhibit limitations when dealing with systems characterised by pronounced nonlinearities, restricting their applicability in complex scenarios.
Nonlinear predictive models are frequently employed for electricity consumption forecasting in buildings, as they effectively capture system nonlinearities [
20]. In this sense, ML models and more specifically neural networks (NNs) have gained relevance [
16]. NNs can enhance the operation of an EMS because they can substantially improve its performance. This is evidenced by [
21], where the authors implement an EMS based on NN models in a building that forecast the building’s consumption, weather conditions, and building comfort specifications. The model predictive control (MPC)-type EMS is implemented in simulation using EnergyPlus software and its performance is compared with a second MPC that does not include predictions as inputs. They conclude that the MPC fed with forecasts outperforms the conventional MPC.
Despite their advantages, complex NN architectures often require extensive datasets and computational resources for training. In [
22], multiple-layer perceptron (MLP)-type networks are proposed with 12 inputs, adjusted between 24 and 48 neurons, and two or three hidden layers. This complex structure results in the network predicting building energy consumption with an MAPE of 1.71%, but requiring 14 h of training. Similarly, the long short-term memory (LSTM) network presented in [
23] employs 512 neurons, but delivers a suboptimal MAPE of 21.99%. Alternatively, in [
24], a convolutional neural network (CNN) is combined with a double LSTM that comprises 10 layers and over 33,000 parameters. While the MAPE obtained is excellent, the training time is extremely high. This makes it difficult to regularly update the model, which is of great interest when there is a substantial change in, for example, weather conditions (sudden increase in temperature) or in the pattern of the consumption curve when a high-consumption device is added.
A critical challenge in ML-based energy forecasting is the availability of long-term training datasets, particularly for newly constructed buildings equipped with smart meters, but lacking historical data. In [
25], an NN model trained on a three-day time window successfully forecasts small-scale loads within a building, achieving an MAPE of 7.05%. However, studies exploring ML models trained on limited datasets remain scarce. The ability to train models on small datasets offers the dual advantage of reduced computational cost and the flexibility for frequent retraining, enabling the model to rapidly adapt to dynamic conditions. In this regard, new types of NNs are coming to light that are appropriate in the case of small-dataset availability. Examples of these are the Kolmogorov–Arnold networks (KANs) [
26] or liquid neural networks (LNNs) [
27]. The former are presented as an alternative to MLPs, which mainly differ in how learning is performed. While the neurons in MLPs are activated by fixed activation functions (sigmoid, linear, etc.), in the case of KANs, the functions are learnable (B-spline) [
28]. As for LNNs, they stand out for their ability to adapt to noisy conditions and have a simple structure, which may make them suitable for applications where computationally low-cost predictions are required.
Moreover, while several studies in the literature explore ML-based approaches for predicting energy consumption in existing buildings, most rely on simulated data rather than real-world measurements. Consequently, the number of studies implementing ML models in fully operational systems remains limited.
To address this gap, this research proposes the design, implementation, and evaluation of two ML models deployed in a real system. These models perform day-ahead hourly consumption forecasting for a single building in real-time, as required by the EMS. As the EMS needs to control the consumption of the HVAC system, the prediction of the building’s electricity consumption should not include the effect of the HVAC system. Therefore, the consumption corresponding to the HVAC system has been removed from the total consumption curve of the building automatically.
Following the methodology outlined in [
29], where three ML models were compared, this study evaluates the performance of two ML approaches: a nonlinear autoregressive model with exogenous inputs (NARX) and a SVR model. It may be interesting to compare both ML models, because this comparison is not often seen in the literature where simple predictive models trained with few data are proposed. To validate the effectiveness of ML approaches, a benchmark model is also introduced for comparative analysis.
This study investigates the impact of training ML models, specifically NARX and SVR, with small datasets and how this affects their predictive performance. Unlike traditional approaches that rely on large-scale datasets, this research challenges the assumption that extensive data are necessary for effective model training. Additionally, it focuses on the development of real-time predictive models tailored for integration into an EMS operating in real-world conditions, ensuring that data acquisition and pre-processing can be performed in real time. A key aspect of the study is the analysis of recurrent terms in ML models for time series forecasting, particularly in the context of building energy consumption prediction. Since SVR lacks a recurrent component, this work introduces a novel approach by incorporating a time vector to compensate for this limitation, enhancing its ability to capture temporal dependencies.
The primary contributions of this work are threefold. First, it provides an evaluation of the impact of small-dataset training on NARX and SVR performance, offering insights that challenge the prevailing dependence on large datasets in ML-based energy forecasting. Second, a methodological framework for the development of real-time predictive models is proposed, addressing critical challenges of real-time data acquisition and data pre-processing within an operational EMS. Lastly, the study presents a novel approach for integrating temporal dependencies into SVR using a time vector, demonstrating its effectiveness in improving time series forecasting accuracy. These contributions collectively advance the field of real-time energy forecasting by enhancing model adaptability to limited data and improving the predictive capabilities of SVR in time-dependent scenarios.
The remainder of this paper is structured as follows.
Section 2 presents the case study, to which the methodology introduced in
Section 3 is applied. In
Section 3, the detailed step-by-step design of the predictive models is introduced.
Section 4 discusses the obtained results, and
Section 5 summarises the main conclusions.
3. Methodology
Day-ahead forecasting of ESTIA 2 average hourly power was performed with the NARX and SVR models. With a sampling time of 1 h, the prediction of the following 24 h was carried out during one week of April. The difference in SVR model performance was analysed when a time vector was introduced as an additional input to the model.
Furthermore, the
data used as input to the model were not measured data, but predicted data. Data acquisition and pre-processing were performed automatically (see
Section 3.2). Both models, NARX and SVR, were designed, and were evaluated and compared based on the minimisation of MAPE and R
2 (see Equations (2) and (3)). Likewise, both ML models were also compared with a benchmark model, named the persistence model. The persistence model assumes that nothing changes between the present time and the future, meaning “tomorrow will be as today”. In a time series, the previous value is considered the present value [
30].
Figure 5 shows the steps followed to design the proposed ML models to carry out day-ahead forecasting of the consumption.
3.1. Evaluation Metrics
Widely referenced evaluation metrics in academic studies include the mean absolute error (MAE), root mean square error (RMSE), or MAPE. It is also quite common to find the coefficient of determination (R2) metric in many papers that attempt to evaluate the performance of predictive models.
In this work, MAPE and R2 metrics were employed to evaluate model performance. MAPE represents the measure of the average deviation in absolute terms. It is therefore related to average absolute errors between the actual and predicted values.
MAPE evaluates uniform forecast error in percentage [
30]:
where
and
are the measured and corresponding predicted values.
is the total number of data in the dataset considered for performance evaluation.
R
2, in contrast, is a metric related to mean square error (MSE) and the variance. Therefore, it represents the values that deviate significantly from the real value, meaning the outliers of the predictions. Equation (3) calculates the coefficient of determination.
where
is the mean square error between measured and predicted values and
the variance value of the measured dataset. Likewise, as in Equation (2),
and
are the measured and corresponding predicted values.
A value for R
2 of 1 means a perfect fit between measured and predicted values. A R
2 of zero means that there is no relationship between the dependent (output) and independents (input) values, or in other words, the regression line between both is completely horizontal [
31] (see Equation (4)).
To assess the performance of the proposed prediction models over the analysed period (one week), it is essential to evaluate both the mean error of the models’ predictions and the occurrence of unusually extreme values or outliers. Both aspects are considered critical for evaluating the models’ effectiveness. Moreover, these two metrics provide intuitive values for comparison, distinguishing them from other commonly used metrics such as RMSE or MAE.
3.2. Automatic Data Pre-Processing
Robust data pre-processing is critical in ML, converting raw data into a structured format optimised for analysis and model training [
32]. Real-world datasets frequently contain missing values, noise, and inconsistencies arising from measurement errors and variations in data collection methodologies. As said, this study examined the ESTIA 2 building dataset, which includes building load, HVAC load from ten heat pumps, and meteorological data. Disparities in sampling rates, measurement units, and timestamp formats necessitate systematic data integration to ensure consistency and enhance forecasting accuracy.
The following pre-processing steps were performed automatically.
Data Quality Control: Missing values undergo treatment through deletion or imputation (linear interpolation, mean imputation), while time-based integrity checks identify and eliminate corrupted records.
Timestamp Standardisation: Timestamps in multiple formats (e.g., Unix time) are converted to ISO 8601 (UTC) to enable seamless dataset integration.
Time Series Aggregation: Given the hourly forecasting resolution, smart-meter time series are resampled using a rolling mean window of ±30 min, following prior work [
29]:
where
represents the recorded measurement at time
,
is the set of recorded timestamps within the rolling window for hour
, and
is the total number of recorded measurements within that window.
This rigorous pre-processing framework enhances data consistency, refines input quality, and improves predictive modelling accuracy.
3.3. Predictive Model Design
NARX neural networks, a subset of recurrent neural networks (RNNs), have been extensively employed in time series prediction due to their straightforward implementation and fast training process.
These networks are built upon an MLP architecture, comprising an input layer, a hidden layer, and an output layer, all interconnected through adjustable weights. The neurons in both the hidden and output layers are also associated with bias values [
33]. During the training phase, the network iteratively adjusts these weights and biases to optimise their values, enhancing the relationship between the input and output data.
Equation (6) shows the input–output relationship using a NARX:
where
is the future value of the target variable,
p is the total exogenous inputs,
is the time lag of each exogenous input
, and
is the time lag of the historical targeted values (
.
For the NARX model design, it is necessary to select the inputs that can best describe the consumption behaviour, as well as the TW, which defines the number of days selected to perform the model training. With regard to the selection of inputs and the TW, it was decided to use the results of the analysis carried out in [
29]. In that work, the inputs to the model are selected by performing the prediction with all possible input combinations and seeking to minimise the MAPE. Therefore, predicted external temperature and occupancy were used as inputs. The two mentioned inputs were selected and other possible combinations were discarded as they did not provide better results. Moreover, a time vector was considered as input to the SVR model, in order to compensate for the non-recurrent nature of this ML technique.
Regarding the TW, following the analysis carried out in [
29], we chose to train the models with a TW of 21 days. In the mentioned work, predictions were carried out after training the model with different TWs (7, 14, or 28 days), and it was concluded that training the model with 21 days yielded the most accurate predictions.
Finally, the hyperparameters of the models were adjusted. The tuning of the NARX hyperparameters was carried out manually following a certain criterion. The model performed day-ahead forecasts on a daily training basis. Each time a NN model is trained, the weights of the model are randomly initialised; therefore, we decided that the model needed to predict entire weeks of April three consecutive times on a daily training basis. The average weekly MAPE obtained for each of the three times was calculated. The process was repeated for each combination of hyperparameter values, i.e., for each model constructed. The model with the lowest average MAPE was selected.
Regarding the NARX model, three different hyperparameters were adjusted to try to achieve the lowest MAPE: (i) input and feedback delays, (ii) number of neurons in the single hidden layer, and (iii) activation function for the hidden and output layers. The simulation plan outlined in
Table 2 was designed accordingly, where all possible values for each of the hyperparameters are combined. Each of the possible combinations of the hyperparameter values defined in
Table 2 constitutes an NARX model.
After carrying out the forecasts with all possible NARX models, the model able to forecast the selected week with the lowest mean MAPE was selected. The chosen NARX model hyperparameter values are given in
Table 3.
Support vector machines (SVMs) are a supervised learning approach commonly used for function estimation. While SVM is primarily applied to classification problems, it is also well suited for regression tasks. The regression variant of SVM is known as SVR, with the most widely used formulation being Vapnik’s ε-SVR [
34]. For SVR with a radial basis kernel, three hyperparameters need to be determined: the penalty coefficient C, ε, and the kernel coefficient γ.
For the SVR model design without taking as input the time vector, the optimal hyperparameter values are given in
Table 4. Bayesian optimisation [
35] was employed to optimise its hyperparameters.
For the design of SVR considering time vector, the optimal hyperparameter values selected are given in
Table 5.
The overfitting problem in the case of NARX was avoided thanks to the early stopping method of the Deep Learning Toolbox of Matlab [
36]. To prevent SVR models from suffering overfitting or underfitting, cross-validation was applied while evaluating different model settings during training. However, given the nature of the time series data, a specific implementation from scikit-learn, TimeSeriesSplit [
37], wasn used. This method, a variation of k-fold cross-validation, ensures that the temporal structure of the data is preserved.
4. Results and Discussion
The forecasting results are presented in this section. In order to be able to compare both NARX and SVR model performance,
Table 6 and
Table 7 record the MAPE and R
2 values obtained using the proposed ML models and the persistence model each predicted day. A whole week was predicted.
Beginning by comparing the ML models with the reference model, it can be seen that the persistence model outperforms the SVR model that does not consider the time vector. However, it fails to improve on NARX on the vast majority of days, and does not exceed the accuracy with which it predicts the SVR with the time vector. In general, it can be stated that the persistence model gives good results only when the curve does not vary from day to day. In the case of predicting the consumption of a building, the consumption varies considerably, so the persistence model can hardly overcome the operation of ML models.
Moreover, it is possible to observe a clear difference between the SVR results with and without the time vector. Considering the average MAPE of the week, the SVR with the time vector as input outperforms the SVR model without considering the mentioned input by 50.22%.
Figure 6 shows the consumption curves of NARX and SVR without the time vector compared with the real consumption during a week of April 2023. Likewise, the persistence model is also drawn. It can be seen how clearly the persistence model fails to represent the real curve, especially on Mondays and Saturdays, where there is a change in the consumption pattern. Likewise, the SVR that does not include the time vector input hardly follows the real consumption curve. Looking at the output of this model, it seems that it closely resembles the occupancy curve of the building, which has two peaks during the day and a null value during the night and at weekends.
The MAPE and R
2 outcomes clearly demonstrate that the SVR model that considers time vector as input surpasses the NARX model on the majority of days. When examining the average MAPE achieved by the two models over the course of the week, SVR shows a 3.62% improvement compared to NARX. Although this difference in accuracy is not substantial, SVR proves to be superior in closely tracking the consumption pattern, as illustrated in
Figure 7.
The NARX model evidently struggles to accurately forecast the consumption pattern during night hours. Looking at the curves at night, especially early mornings and first hours of the night, it can clearly be observed that NARX suffers from sudden peaks. This may be due to abrupt changes in the occupancy curve during early morning hours (from 0 to 1) and evenings (from 1 to 0). It can also clearly be seen how the models are worse at weekends.
On average, NARX and SVR with time vector predict working days 37.64% and 28.27% better than weekends, respectively. There may be different reasons that the accuracy of the weekend consumption forecast drops. It may be that there is a lack of sufficient weekend training data: when using a 21-day TW, only 6 weekend days are used to train the model. Likewise, the occupancy curve at weekends might be improved. During weekends, the occupancy is considered zero; however, some workers from start-ups work on Saturdays. This can result in the model not being able to accurately follow the weekend consumption pattern.
In order to more extensively study the calculated value of 3.62% improvement in SVR with the time vector with respect to the NARX model, a paired-t test and a Kolmogorov–Smirnov test were performed in order to assess if this improvement was statistically significant.
Paired t-test: The calculated the statistical difference between the means of the two curves using Equation (7).
where
is the average of the value differences between the two compared curves,
the standard deviation between the differences, and
the number of samples.
The paired t-test yielded a value of 0.53, which exceeds the commonly used significance threshold of 0.05 or 5%. This indicates that the null hypothesis cannot be rejected, suggesting that the mean differences between the two prediction curves are not statistically significant.
Kolmogorov–Smirnov test: In order to compare the full distributions of the predictions of the two models and not only their means, the Kolmogorov–Smirnov test was performed. Obtaining a maximum cumulative difference (D) of 0.16 from a sample (n) of 168 points and assuming a significance of 5%, which is the most commonly used, the critical value was 0.221 according to the critical table. As 0.16 < 0.221, the null hypothesis cannot be rejected either, and this suggests that there is no significant difference between the two distributions at the 5% significance level, i.e., the predictions of both models follow a similar statistical distribution.
5. Conclusions
Two ML models, NARX and SVR, were designed and implemented in the framework of the study presented in this paper. Their aim was to perform day-ahead forecasting of the average hourly power consumption of the ESTIA 2 building without considering heat pump consumption, as specified by the EMS. Indeed, the EMS aims to control the building’s HVAC system to maximise the SCR.
Due to the limited data available, the proposed ML models were trained on a small dataset. Predicted external temperature and building occupancy were used as inputs to the model.
As the forecast models operate as in a real system, an automatic pre-processing step was added in order to automatically obtain predicted external temperature data from the MG meteorological agency and apply pre-processing techniques to all the needed data. The automated data acquisition and pre-processing made it much easier to introduce the prediction of, for example, a second building into the analysis. Thus, the methodology is scalable to more buildings. Likewise, the proposed methodology allows extension of the range of data, i.e., to use more data for the training of the ML models if so desired.
The proposed ML models were compared with a benchmark model, i.e., the persistence model. As is evident, this model presents problems when the consumption pattern changes from one day to the next. This is why NARX outperforms the reference model nearly every day. With respect to the SVR, which does not consider the time vector, it fails to represent the real consumption curve and fails to improve the persistence model. Nevertheless, when predicting with the SVR model that considers as input a time vector, the results improve substantially: it manages to predict the consumption of the analysed week with an accuracy of average MAPE and R2 of 10.73% and 0.85, respectively, 50.22% better than the SVR without the time vector. Unlike NARX, the SVR lacks a recurrent term, which allows for representing the dynamics of the systems. Considering the mentioned input may have helped SVR to predict more accurately.
In general terms, it can be stated that for this case study and under these conditions, the SVR with the time vector is the model that outperforms the rest, despite predicting with an accuracy of only 3.62% better than the NARX model. However, paired t-tests and Kolmogorov–Smirnov tests suggest that regarding mean values and distributions, there is no evidence that the predictions made by the time vector SVR and the NARX model are statistically different.
Moreover, it was observed that by removing the heat pump consumption from the total consumption curve of the building, the consumption curve became substantially less variable, i.e., smoother. This probably made the prediction much simpler. While it has been found that NARX can be very effective in predicting curves with high variability [
29], it may be that in this case where the curve is flatter, SVR has had an advantage over NARX.
It is also worth mentioning that in general, all the models show worse results when predicting weekend consumption, especially NARX and SVR without the time vector. It could be possible in future work to improve the occupancy curve in order to bring it more in line with reality. Since actual occupancy data for the ESTIA 2 building are not available, it could be tested in another case study where real occupancy data would be available on a day-to-day basis.
Likewise, the methodology was tested in only one month of the year. For this reason, we believe it is necessary to test the methodology in other periods in future work, in order to generalise the models and be able to draw more general conclusions.
Moreover, seeing the relevance that hybrid models are gaining for prediction purposes, in future work, it might be interesting to see whether or not a hybrid model based on, for example, an SVR with ensemble learning could outperform the simple SVR model.
In other future work, the hyperparameter adjustment process could also be optimised by applying techniques such as genetic algorithms (GAs).
Finally, it may also be of interest to extend the study by calculating the confidence intervals of the predictions.