1. Introduction
Modern households are increasingly dependent on electrical energy, having multiple appliances connected to the power grid. As result, the increasing demand for energy has been a major concern for governments and society in general and has led not only to investments in renewable energy sources [
1], but also in the development of smart and energy efficiency solutions, such as smart meters [
2]. In this context, we are witnessing the emergence of renewable energy communities (RECs), whose goal is to democratise access to clean energy by enabling citizens and local businesses to produce their own energy and share the surplus with their neighbours, thus reducing the overall consumption costs. Leveraging trend technology areas such as artificial intelligence (AI) and specifically machine learning (ML), we are now able to track energy consumption and generation patterns, so that better decisions can be made regarding its use. In a REC setting, the accurate prediction of periods with higher costs or lower availability of renewable energy could trigger a shift in energy-intensive activities, such as charging an electric vehicle or running a washing machine, to other periods, which is desirable for all consumers from an economic and environmental perspective.
To tackle these use cases, we can apply ML models that are able to generalise unseen data. Nevertheless, as demand and generation patterns are influenced by multiple factors, such as the weather, the time of the day, the day of the week, the season, and the presence of occupants, the data distribution is often non-stationary. Therefore, we need to ensure frequent retraining—that is, training and inference should be interspersed with a certain frequency to avoid performance degradation. In such a scenario, it is essential that the training time is as short as possible.
In the present study, we aim to enable faster model convergence, while preserving the privacy of individual buildings in the residential environment. We thus examine the effects of applying to specific buildings a general energy forecasting model trained on a common dataset, with and without fine-tuning on the local datasets. This approach is then contrasted with the use of a local model, trained exclusively on data from a single building. For this purpose, we used a synthetic dataset provided by the NeurIPS CityLearn Challenge 2023 [
3], a competition that aims to promote the development of energy-efficient solutions for smart buildings. The dataset comprises hourly registers of energy loads, solar generation, and carbon intensity from six buildings over a period of 92 days. Our goal was to develop models that could predict these variables one hour ahead, and ultimately several hours ahead. However, by defining a multi-step memory window, the number of features greatly increases; hence, there is a need for a method to reduce the input of the models. In this study, we first apply principal component analysis (PCA) or a variational autoencoder (VAE) and then test it with three different models: an artificial neural network (ANN), a 1D convolutional neural network (1D-CNN), and a long short-term memory (LSTM) network.
The following sections describe the applied methodologies and the conducted experiments, from which we were able to draw some conclusions. Firstly, the global scenario presents a lower average error than the local scenario, but the latter outperforms the fine-tuning scenario, while drastically reducing the training time. Regarding the dimensionality reduction methods, PCA cannot capture the data’s complexity, given its linear nature. On the other hand, with VAE, we are able to decrease the test error by up to 63% and the training time by up to 80%, proving that it is possible to compress the relevant temporal information into a lower-dimensional space without noise.
2. State-of-the-Art
In the context of the ongoing global transition to a more sustainable energy system, the literature on developing energy forecasting models for smart grids is quite vast. In [
4], the authors provided a comprehensive review of the advances in methods for forecasting electric energy consumption, comparing statistical methods, such as auto-regression and moving average models and multiple linear regression, with AI-based methods, such as ANNs, support vector machines (SVMs) and gradient boosting (GB). Relying on previous work [
5], they identified the main disadvantage of ANNs as the lack of prediction interpretability, in opposition to the other tested models. Yet, they also highlighted that ANN and multivariate models, such as SVM and multiple linear regression, tend to outperform univariate statistical models in terms of accuracy. Furthermore, in the context of smart and sustainable communities, besides understanding the reasoning behind the model predictions (transparency), there are some other crucial concerns about multivariate time-series forecasting addressed in the study [
6]: the ability to reuse knowledge by transferring the model to different scenarios (transferability), the limited use of computational resources and the reliability of the models when faced with abrupt changes in the data. Regarding the application of traditional ML models to multistep-ahead forecasting [
7], there are two main approaches: a recursive way, where the model for a given time step t
0 of the forecast horizon is fed with the previous predictions at t
−1, t
−2, …, and a direct way, where we build different models for each time step of the forecast horizon. The latter is more computationally expensive, but it reveals less error accumulation since the bias and variance are not propagated through the forecast horizon.
Forecasting usually implies the creation of lag features that encapsulate the past behaviour of the target variables. In [
8], the authors differentiated four types of memories (very short-term, short-term, medium-term, and long-term) and related them with the forecast horizon. For example, when predicting minutes to hours ahead, it is common to focus on very short-term memory, comprising the past 2–60 min, whereas when dealing with hours- to day-ahead predictions, we should focus on short-term memory, comprising the past 1–2 days. Harnessing their conclusions and an auto-correlation plot, one can obtain a sense of how many lags we should consider. To speed up the training process, the literature recommends the use of dimensionality reduction techniques in forecasting tasks. A study on reduction techniques for forecasting models [
9] employed principal component analysis (PCA) on a water quality dataset and concludes that the inter-correlation between the features was successfully eliminated, the computational time decreased, and the accuracy of the model increased. In contrast, the study [
10] presents a supervised local maximum variance-preserving (SLMVP) method to forecast solar radiation and beats PCA, stated as the currently most used reduction method in renewable energy forecasting. The major drawback of PCA is that it is not able to capture non-linear relationships between the features. To overcome this limitation, autoencoders (AE) are also a solid alternative. There are a few applications of it in this area, as they help explain energy demand predictions by generating an interpretable latent space [
11].
In a residential energy forecasting context, there are also privacy concerns that should not be overlooked. Energy patterns hold valuable information about the occupants’ behaviour, which can be used to infer their daily routines, such as when they are at home, when they are sleeping, and when they are away. To deal with this issue, the literature suggests the use of global models, which are trained on data from multiple sources, instead of local models, which are trained on data from a single source. In [
12], the authors argued that a global model tends to generalise better than a local model. They also demonstrated that there is always a global model that can match or exceed local models in terms of performance, regardless of the heterogeneity of the time series, but developing such model is not an easy task. While local models become more complex as the dataset size increases, global models remain constant, so they can be surpassed by the first ones. To guarantee a large enough complexity in global models, three solutions are proposed: a memory increase (more lag features), the addition of high-degree polynomial features, and data partitioning, which is equivalent to reduce the number of samples for training.
Outside the scope of ML/AI, in general energy management systems, optimisation is also a major concern. In [
13], a novel methodology is presented to enhance the post-disaster restoration of integrated power distribution and district heating systems (PDS/DHS). The paper introduces a coordinated maintenance and reconfiguration strategy that considers the complex interdependencies between these systems to optimize fault recovery. For that concrete purpose, the authors apply a two-stage acceleration algorithm that improves computational efficiency while maintaining solution accuracy, making it a viable approach for large-scale energy networks.
3. Scenario Description
To look into the notions of locality and globality, one should define what is considered global data (shared by all buildings) and local data (specific to each building).
Figure 1 illustrates the data partitioning required for scenario setup.
Half of the samples from each building are kept for global training, so that the global dataset results from the concatenation of
,
, …, and
(
, in this case). The remaining samples are meant to be used for local training and testing with a ratio of 70:30. As result of the global/local data partitioning, we consider three different scenarios for residential energy forecasting (see
Figure 2):
- (1):
local training, testing with local data.
- (2):
global training, testing with local data.
- (3):
global training, fine-tuning with local data, testing with local data.
Regarding the scenario validation—which presupposes the variation of hyperparameters—the benchmark order is important to avoid redundant training. For instance, scenario (3) requires global training before local fine-tuning, so the global models resultant from (2) should be saved in disk to be later fine-tuned.
Regardless of the scenario, dimensionality reduction plays an important role in speeding up the convergence of the predictive model. Before feeding the latter with a lookback window of size
l—a predefined number of timestamps required to make a prediction—a reduction technique may be applied. As further discussed in
Section 3.3, we test the absence of reduction against PCA and VAE. In
Figure 3, where
n denotes the last timestamp before the prediction, we see how the mentioned methods are integrated into the model training pipeline. Originally, each sample from the provided dataset consists of the hour, the day type (day of the week) and
D features, but after lagging (see
Section 3.3), the dimensionality is
. As it is depicted in the first stage of the pipeline, the temporal information is preserved, while the past values of all non-temporal features and the target are included. The reduction step is then applied, fixing the input size to
. Finally, the model outputs the next
f values of the target variable.
All the scenarios presented relied on the Tensorflow library and the validation experiments were conducted on machine equipped with a NVIDIA GeForce RTX 2080 GPU (manufactured by Asus; sourced in Aveiro, Portugal), 8 CPU cores at 2.4 GHz, and 64 GB of RAM. The code was developed in Python 3.12 and the main imported libraries were Keras (3.6.0)/Tensorflow (2.16.1), Pandas (2.2.3), Numpy (1.26.4) and Scikit-learn (1.5.2).
3.1. Data Description
Considering that most electric utility providers charge in monthly cycles, it was essential to have at least one full cycle. With 2208 hourly registers (92 days) per building, we have three full cycles, where each sample contains 16 variables: 11 features and 5 targets. Nevertheless, when we focus on predicting one of the targets, the other four are considered as inputs, totalling 15 features which enable our multivariate forecasting approaches.
Some of the non-target variables contain temporal information (month, day of the week, hour) and others are related to the current building conditions: indoor temperature, HVAC mode (off/cooling/heating), temperature setpoint, indoor relative humidity, number of occupants and average unmet cooling setpoint difference—that is, the difference between the indoor temperature and the cooling setpoint. There are another 2 variables (heating load and the status of daylight saving mode), but we verified that both have a constant value for all buildings, so they were discarded.
Finally, the five target variables are all numerical, and three of them differ from building to building, while the other two are relative to the neighbourhood.
3.2. Exploratory Data Analysis
Figure 4 shows the time series for domestic hot water (relative to building 1) and carbon intensity. We first verified that the variables have very different scales, so we applied z-score normalisation to all the developed solutions. Plotting all the provided variables over time, we concluded that the building-level variables are very noisy. On the other hand, neighbourhood-level variables show clear patterns. We also noticed that both solar generation and carbon intensity have a daily seasonality, being higher during the day and lower during the night. However, the carbon intensity is more volatile, which is expected since it is influenced by the type of power generation sources.
When plotting the value distribution of the target variables, as presented in
Figure 5, we observed that the non-shiftable load and domestic hot water demand have a right-skewed distribution with a long tail.
Note that the graph only shows data from building 1, but we came to the conclusion that all buildings have similar distributions. We also apply a filter to non-shiftable load, where we set all values above 2.3 as zero, making the distribution more similar to the normal. The same could be performed for domestic hot water demand. For solar generation, we did not apply any transformation, because the pattern is very predictable.
Finally, we computed the auto-correlation of the target variables, as it is a common practice in state-of-the-art methods, and displayed graphics with blue-shaded regions representing the confidence intervals.
Figure 6 emphasises the difficulty of predicting non-shiftable load and domestic hot water demand, given their very low auto-correlation. For the other target variables, we really see the relevance of the past 24 h, so we used memory sizes that were multiples of 24, when creating lag features to feed the models. As was the case with the value distribution plots, we concluded that auto-correlation is also similar in the various buildings.
3.3. Data Preprocessing
For each developed model, we define a lookback window, which is the number of hours that the model uses to make a prediction (24, 48, …), so the first step was to create lag features for each variable, using `shift’ method from Pandas or `roll’ from Numpy. Then, we build our forecast horizon (e.g., 1 h) using the same methods in a different orientation. For normalising the resultant features, we apply the `StandardScaler’, which subtracts the mean and divides the sample by the standard deviation, scaling to unit variance.
Regarding dimensionality reduction, we first tried PCA to obtain a smaller set of features that explain most of the variance in the data. We started by fitting the PCA after the creation of lag features and retrieving all the principal components. Computing the cumulative explained variance ratio, we automatically select K components such that 95% of the variance is retained and, as expected, the K value strongly depends on the size of the lookback and forecast windows.
In
Section 2, we point out AEs as an alternative to PCA. Even though they have a different logic behind them, we must bear in mind that the latent space of a simple linear AE is very similar to the resulting eigenspace in PCA. Adding non-linearity and depth to the AE would make it more capable of learning good representations of the data. Yet, whereas PCA ensures the orthogonality of the new features, AEs do not. In fact, the latent space of an AE is not guaranteed to be regularised, i.e., to comply with the idea of continuity and completeness [
14]. Continuity means that two close points should not produce (in terms of Euclidean distance) different contexts once decoded, and completeness means that a data point from the resulting latent space should have meaning. To tackle this limitation, we apply a variational autoencoder (VAE). Instead of a fixed vector, the latent space is a pre-defined distribution (default: Gaussian), defined by a mean
and a standard deviation
. From this distribution, expressed as
, we draw a random variable
(sampling) and the decoder maps
z back into the input space so that it is as close as possible to the original data. To ensure the aforementioned regularisation, one must enforce the encoded distribution to be close to a normal distribution with the Kulback–Leibler divergence term [
15], which can be perceived as the number of bits required to convert one distribution into another. In parallel, we want the decoder output to be as similar as possible to the input data. In this sense, reconstruction on the last layer and regularisation on the latent layer form the loss function to be minimised.
Briefly, the main idea of VAE is to introduce randomness and ultimately prevent overfitting. However, the sampling process is not differentiable and does not allow the backpropagation of the gradient through it. To overcome this, we apply the reparametrisation trick by adding an auxiliary variable to the latent space, so that (standard normal distribution) and . We keep the randomness in the model, but at the same time the sampling operation becomes deterministic.
For our concrete scenarios, we implemented a VAE, composed of an encoder with 2 Dense layers, which maps the input data (
x) to a latent space, and a symmetric decoder that maps the latent space to the reconstructed data (
). The two parts are trained jointly by minimising the reconstruction loss, as presented in Equation (
1), and the KL (Kullback–Leibler) divergence, as presented in Equation (
2). The latter measures how much the latent distribution differs from a normal distribution. The total loss is the mean of the reconstruction loss and the KL divergence loss. As activation functions, we used the Swish function, which is a smooth and non-monotonic function that has been shown to outperform ReLU in some cases [
16]. Since our intention is to generate a latent space, after training, we only keep the encoder. With this key change, we were able to reduce the dimensionality of the data from a thousandth order of magnitude (after lagging) to one order of magnitude, keeping only five latent variables and also the hour and day of the week, preventing the loss of relevant temporal information.
n Number of latent components;
Standard deviation of the latent distribution at index i;
Mean of the encoded latent distribution at index i.
4. Model Selection
Faced with multiple target variables, one can either develop a single model to predict all of them or develop a separate model for each target variable. As previous studies [
17] have suggested that target-specific modelling tends to be more efficient, we proceeded with the latter approach. Adopting it, we can use cooling demand as an input feature for a model designed to predict non-shiftable load, for example.
As mentioned in
Section 2, purely statistical models with handcrafted features result in less accuracy, despite their interpretability, and creating a recursive chain of ML models (linear regressions, decision trees, random forests, etc.) propagates the error through the forecast window, when we extend it to several hours. Hence, we decided to implement three deep learning (DL) methods. The source code can be found in the following GitHub repository:
https://github.com/RafaelGoncalvesUA/Accelerating-Residential-Energy-Forecasting (accessed on 11 February 2025). For facilitating the hyperparameter variation during the benchmarking process, we developed a single Python `run.py’ script that receives the arguments presented in
Table 1.
4.1. Artificial Neural Network (ANN)
The first DL method was a simple ANN with an input layer, two hidden layers with ReLU as activation function and a given number of units, and a linear output layer with the same number of units as the prediction horizon (
N). Between the hidden layers, we added dropout regularisation. As illustrated in
Figure 7, all the time steps of the lookback window are concatenated and fed into the input layer, creating a many-to-many architecture.
Representing mathematically the ANN model, the input layer can be expressed as shown in Equation (
3). Since ANN has more than one hidden layer, the transformation (application of weights, biases and the activation function) of the input vector is applied iteratively through each layer before computing the final output, as shown in Equation (
4) (excluding the dropout regularisation for simplicity).
X Input vector containing the past l time steps:
Weights and biases of the first layer;
Activation function of the first layer;
Output vector that is fed into the next layer.
output vector containing the next f time steps: ;
weights and biases of the second layer;
activation function of the second layer.
Recall that each model developed was intended to predict only one of the target variables, so the outcome was 5 different models. For all of them, we fixed the number of units for the first and second hidden layers to 64 and 32, respectively. The dropout rate was varied between 10% and 20%.
For predicting 1 h ahead with a lookback window of 24 h, the ANN configurations (arguments and dimensionality reduction method) that performed better in average, considering all possible contexts (building, scenario, target variable), are presented in
Table 2. Note that the target variables were normalised before averaging the values.
4.2. Convolutional Neural Network (CNN)
CNN differs from ANN in the way it processes the input data. While the ANN processes the input data as a whole, the CNN applies convolutions in small sub-sequences of size
c (see
Figure 8). To avoid an even larger configuration space, we fixed
c to 3. The architecture was still many-to-many and consisted of an input layer, a 1D convolutional layer without padding, a dense layer with
units (
f: forecast horizon,
R: number of latent variables), and a reshape output layer. The advantage of the 1D convolution is that it captures local patterns from a single dimension. During validation, we varied the number of filters (32, 64) and the kernel size was set to
c.
According to the described architecture, each time step predicted by the convolutional layer can be represented as shown in Equation (
5). Then, for the dense layer, the operation is similar to the one described for the ANN model in Equation (
4).
i i-th time step in the considered window;
c Convolution kernel width;
j-th element of the 1D convolution kernel W (j ranges from 1 to c);
b Bias term;
Activation function.
As
Table 3 illustrates, for all possible contexts (building, scenario, target variable) of single-step predictions with a lookback window of 24 h, the CNN configurations that performed better in average were the ones using VAE as dimensionality reduction method. There is also a slight advantage in using 64 filters instead of 128.
4.3. Long Short-Term Memory (LSTM) Network
As a recurrent model, LSTM can capture long-term dependencies in the data and, therefore, learn better from sequences. As depicted in
Figure 9, each LSTM unit acts as a memory cell with a one-to-one sub-architecture, where the input is fed into the cell and the output serves as the input for the next cell. The output of the last cell is the prediction.
Similarly to the previous model, the proposed network has a dense layer and a reshape output layer, but the convolution layer is replaced by a LSTM layer using a hyperbolic tangent activation function and the following hyperparameters: (64, 128) LSTM units, (10%, 20%) dropout rate, and (10%, 20%) recurrent dropout rate.
To better clarify the architecture, the set of Equations (
6)–(
11) describe the logic behind the implemented LSTM model, including the input, forget, and output gates, as well as the cell state and the hidden state.
denotes the
i-th last timestep in the lookback window.
i i-th last element in the lookback window;
Hidden state at time step ;
Cell state at time step ;
Weight matrices for the respective gates;
Bias vectors for the respective gates;
Recurrent activation function;
Activation function;
⊙ Element-wise multiplication;
Table 4 shows the best configurations for the LSTM model with the corresponding performance metrics averaged over all contexts (building, scenario, target variable). As mentioned, the lookback window was set to 24 h and the forecast horizon to 1 h. From the results, we conclude that using a higher dropout rate both for the input and recurrent layers leads to better performance. In addition, the exhibited training time is much higher than the one registered for ANN and CNN.
The increased computational cost can be attributed to the sequential nature of LSTMs, which makes it difficult to parallelise training, unlike CNNs that can process multiple time steps simultaneously via convolutions. As further discussed in [
18], during backpropagation, the recurrent network weights tend to decrease and prevent the model from effectively learning correlations between temporally distant events (long-term dependencies). Therefore, the network may experience slow training, as it was the case, or fail to converge—even if ideal training conditions are met and overfitting is tackled with dropout regularisation. Future work could explore alternative architectures, such as transformer-based models, which have shown to effectively model long-term dependencies, while enabling parallel processing. The downside is that they demand a large amount of data and computational resources to train effectively (see
Section 6).
5. Benchmarks
For validating the proposed methods, we use a benchmark script (‘benchmark.py’) that calls the ‘run.py’ script (see
Section 4) with different arguments. Apart from the hyperparameters indicated in the previous section, we varied the source of the training data (global or from a specific building), the training mode (only inference, training, or fine-tuning), and the dimensionality reduction technique (none, PCA, or VAE). Finally, the lookback window and the forecast horizon were set to 24 h and 1 h, respectively.
Each ‘run.py’ execution triggers the training of the selected model for each one of the five target variables, and the results are appended to a CSV file. If the same data source is passed as argument when using VAE, the script will try to load an existent reduction model from disk. If the model is not found, it will be trained again and saved.
5.1. Scenario Comparison
After testing all combinations of hyperparameters mentioned in
Section 4, we compared the 3 proposed scenarios, in terms of mean squared error (MSE) and training time (see
Table 5). We verified that, in average, the error is lower for the global scenario, followed by the local scenario, and finally the fine-tuning scenario. However, the training time follows the opposite order: the global scenario is the most computationally expensive, while the local one is the fastest. In this context, one must balance the trade-off between error and training time, depending on the specific requirements of the application.
5.2. Reduction Technique Comparison
A model that produces low error but takes too much time to train is not suitable for a real-world scenario. In this context, dimensionality reduction is vital not only to deal with the most irregular targets, but also for improving the training speed. In our benchmarks, we evaluate the models in terms of MSE, and also in terms of training time, using the expression in Equation (
12).
In
Table 6, we present the average and maximum improvement in MSE for each reduction method. The VAE reduction method was able to reduce the error by more than 5% on average, and by 62.82% at most. On the other hand, PCA increased the error by 13.92% in average, despite reaching a maximum 8.09% improvement. This result suggests that PCA might have failed to capture the variability of the data, in virtue of its linearity, or might have retained too much noise. Recall that we normalise the data before training, otherwise there would be huge differences in the magnitude of the errors, e.g., the error for solar generation would be of a larger order of magnitude than the error for carbon intensity.
As for the training acceleration (see
Table 7), the VAE reduction method was able to speed up the training by 48.57% on average, and by 79.91% at most. PCA also improved the training time, but to a lesser extent, with an average speedup of 24.03% and a maximum of 51.67%.
Since the scenarios described in
Section 3 may require different models, we first found the best configuration for each scenario, taking into consideration all buildings and target variables. As shown in
Table 8, for all scenarios, the best performer was the exact same configuration: a VAE reduction followed by a CNN model with 64 filters. With regard to training time, the 128-filter CNN model was able to converge faster, as
Table 9 illustrates.
Then, we intended to find the best configuration, regardless of the scenario, i.e., the one that performs better in average for all scenarios.
Table 10 enumerates the top five, listing the average and the standard deviation. We observe that VAE is occupying the first three positions, and as expected from the previous tables, the 64-filter CNN model is the best performer. Standard deviation shows that the error of a given model does not fluctuate dramatically between the three scenarios, but the difference still exists. In this context, the trade-off between error and training time can be explored to give preference to a scenario with lower train time, even if it has a slightly higher error.
Regarding training time (see
Table 11), CNN variants with distinct reduction methods dominate the ranking with similar results. Due to the previously described discrepancy in the average training time of the three scenarios, the statistical significance of their differences was not included in the table.
6. Limitations and Future Work
All models are fed with a lookback window. Assuming they will be re-trained on the fly, we need to consider that we do not have any observations at the start. Our proposal for a real application is to use a buffer dataframe to store as many observations as the lookback window size and smooth the cold start with an exponential moving average (EMA) by applying more weight to the most recent observations.
Since seasonality is a factor that can significantly affect model generalisation, more data would benefit future applications of the presented solutions. Given the quarterly coverage of the dataset (91 days), yearly trends are not captured by the dataset. However, the demand for heating and cooling is highly dependent on the season. In the winter, the demand for heating is higher, and in the summer, the demand for cooling is higher.
Moreover, the implemented models are not directly comparable with others in the literature, due to the unique specifications of this use case. In this context, it is important to test a wide range of models and configurations to find the best solution for the specific problem. Despite their popularity, we emphasise that transformer-based models were not considered in this study, as they are computationally expensive and require a large amount of data to train. A separate study could be conducted to evaluate the performance of these models in the context of energy forecasting.
Regarding data privacy, the idea of using for inference a global model on a local context—with or without previous fine-tuning—is to keep sensitive data private, as energy registers reflect the consumption/production habits of the occupants. Despite the positive results of the study, future work should consider Federated Learning approaches, where the model is trained locally, and only the weights are shared with the global model. This way, the privacy of the data is preserved, and the global model can be updated with the local models’ weights.
As our dataset lacks some domain knowledge, future work should also consider climate forecasts and the routines of the building occupants, such as the time they wake up, go to bed, take a shower, or leave the building. Finally, a larger neighbourhood (more buildings) could also be advantageous. More buildings contribute to a more representative dataset, despite the distinct characteristics and occupants of each one.
7. Final Conclusions
The present study highlights the critical balance between the model performance and the computational burden reflected in the training time. Through a wide range of experiments, including the assessment of dimensionality reduction techniques and the exploration of the concepts of locality and globality in training, we first found that the models preceded by a variational autoencoder (VAE) consistently outperform the others in terms of mean squared error (MSE), while significantly enhancing training speed. Considering all scenarios, VAE exhibited a maximum MSE reduction of 63% and a maximum training acceleration of 80%.
The results also indicate that training a single model for all buildings in the neighbourhood yields the lowest error but demands considerable computational resources. Conversely, the local scenario provides a faster alternative, but at the cost of higher error rates. The last scenario—fine-tuning the global model—is even faster, but combining the distribution of the shared data and the building-specific data has resulted in a higher error rate on average. Additionally, our findings underscore the importance of selecting appropriate lookback and forecast windows, revealing that a larger memory window can enhance model performance. A larger forecast window, on the other hand, in opposition to what was expected, might benefit the MSE, because the final value is averaged over multiple steps. The next step might be hard to predict, but if there is a repeated pattern in the steps ahead, the model can learn it.
In summary, the choice of model configuration must be guided by the specific requirements of the application at hand. In the presented scenarios, we managed to surpass the baseline MSE and training time, applying the best reduction method. Therefore, we believe that a real energy management system could benefit from the proposed models, for example, integrating the predictions in the decision-making process of a reinforcement learning agent. Nevertheless, the reduced number of buildings and the short time coverage of the dataset are current limitations that should be addressed in future work with the integration of external data sources and the exploration of larger datasets with more buildings, in order to create more representative and generalised models.