Nearly Zero-Energy Building Load Forecasts through the Competition of Four Machine Learning Techniques

Qin, Haosen; Yu, Zhen; Li, Zhengwei; Li, Huai; Zhang, Yunyun

doi:10.3390/buildings14010147

Open AccessArticle

Nearly Zero-Energy Building Load Forecasts through the Competition of Four Machine Learning Techniques

by

Haosen Qin

¹,

Zhen Yu

²,

Zhengwei Li

^1,*,

Huai Li

² and

Yunyun Zhang

¹

School of Mechanical Engineering, Tongji University, Shanghai 201804, China

²

Institute of Building Environment and Energy, China Academy of Building Research, Beijing 100013, China

^*

Author to whom correspondence should be addressed.

Buildings 2024, 14(1), 147; https://doi.org/10.3390/buildings14010147

Submission received: 7 December 2023 / Revised: 3 January 2024 / Accepted: 4 January 2024 / Published: 7 January 2024

(This article belongs to the Special Issue AI and Data Analytics for Energy-Efficient and Healthy Buildings)

Download

Browse Figures

Versions Notes

Abstract

:

Heating, ventilation and air conditioning (HVAC) systems account for approximately 50% of the total energy consumption in buildings. Advanced control and optimal operation, seen as key technologies in reducing the energy consumption of HVAC systems, indispensably rely on an accurate prediction of the building’s heating/cooling load. Therefore, the goal of this research is to develop a model capable of making such accurate predictions. To streamline the process, this study employs sensitivity and correlation analysis for feature selection, thereby eliminating redundant parameters, and addressing distortion problems caused by multicollinearity among input parameters. Four model identification methods including multivariate polynomial regression (MPR), support vector regression (SVR), multilayer perceptron (MLP), and extreme gradient boosting (XGBoost) are implemented in parallel to extract value from diverse building datasets. These models are trained and selected autonomously based on statistical performance criteria. The prediction models were deployed in a nearly zero-energy office building, and the impacts of feature selection, training set size, and real-world uncertainty factors were analyzed and compared. The results showed that feature selection considerably improved prediction accuracy while reducing model dimensionality. The research also recognized that prediction accuracy during model deployment can be influenced significantly by factors like personnel mobility during holidays and weather forecast uncertainties. Additionally, for nearly zero-energy buildings, the thermal inertia of the building itself can considerably impact prediction accuracy in certain scenarios.

Keywords:

load prediction; nearly zero-energy building; competitive learning; machine learning; feature selection

1. Introduction

In the face of the rapid growth of global energy consumption, pressure on energy consumption in the construction industry is becoming increasingly prominent, with the sector projected to account for 50% of global energy consumption by 2050 [1]. To alleviate global warming, ecological degradation, and energy shortage, the international community is actively promoting various energy-saving and emission reduction measures. Nearly zero-energy buildings have become the target of energy saving in construction industries in major countries worldwide [2]. Due to HVAC systems accounting for nearly one-third of energy consumption in the construction industry, there is a keen focus on the improvement of energy efficiency in HVAC systems [3,4]. Many researchers have developed a series of methods for optimizing the operation of components in HVAC systems to improve efficiency [5,6]. However, it is crucial not to overlook that accurate prediction of heating/cooling loads in buildings is vital to ensure the effective implementation of these strategies [7,8,9].

Past load prediction research can primarily be categorized into three types: white-box, black-box, and gray-box methods [10]. The white-box models calculate building load through detailed physical equations, but due to the demand for comprehensive physics information of the building and massive computation, the applicability in scenarios like real-time optimal control is restricted. In recent years, the notable progress in machine learning has drawn attention to black-box models for building load prediction, because of their efficiency and flexibility [11].

Guo et al. [12] innovatively proposed a cooling load forecasting model based on multivariate linear regression (MLR). This model utilizes three notably effective calibration methods, presenting a good consistency between the predicted load and the actual load, with the average absolute error less than 8%. Huang et al. [13] successfully predicted the potential changes in the building’s heating and cooling load with the aid of wavelet neural network technology. They introduced an improved ant colony optimization algorithm to optimize the network parameters, significantly enhancing the accuracy of the model. Liang et al. [14] proposed a novel optimization method specifically for dealing with the hyperparameter problems of multilayer perceptron (MLP). Compared to other optimization schemes, their model demonstrated the most robust performance in predicting air conditioning loads. Tao et al. [15] established a load forecasting model using support vector regression (SVR), introducing an enhanced simulated annealing algorithm to optimize the model parameters, effectively reducing the uncertainty of model learning parameters.

Although data-driven models have achieved satisfactory results in the field of load prediction, there are some limitations in the research. Most cases used by scholars are located in the same city or climatic region, questioning the adaptability of modeling techniques to different environments. Also, the single model employed in these cases is limited in terms of flexibility and comprehensiveness when forecasting loads, hence composite models are increasingly receiving attention. Moreover, some studies have not taken the practical needs of engineering into account and are mostly in the simulation phase. In reality, taking into account changes in weather, the randomness of internal loads, and the uncertain factors of human habits, predictive models inevitably include uncertainties. Therefore, it remains to be verified whether the performance of models derived from simulation environments can meet expectations when applied to real buildings.

Further, each building has unique load characteristics due to differences in function, location, and economic conditions. Input choices that perform well in specific buildings may not have the same performance in others. Until the emergence of a comprehensive, inclusive, and universally accepted dataset, it is necessary to independently select input parameters for different buildings through feature selection. Feature selection can identify the features most closely related to building load for model training [16]. Rational feature selection is crucial to the modeling process, helping to eliminate redundant parameters, reduce model computation costs, and avoid model estimation distortion caused by multicollinearity among input parameters [17].

This research focuses on developing an accurate prediction model for heating/cooling loads in buildings with minimal user intervention and no specific infrastructure requirements. To achieve this, competitive learning is employed through the use of four distinct machine learning methods: multivariate linear regression, support vector regression, multilayer perceptron, and XGBoost. These algorithms cover a wide range, from linear to nonlinear, from single models to ensemble methods, ensuring that the final model maintains high-level prediction on various datasets. This approach facilitates the adaptive reusability of the model in diverse building types and equipment configurations. The computational cost of model training and deployment is reduced by implementing sensitivity and correlation analyses to lower the dimensionality of the input parameters, making it possible to run the model on standard industrial computers. The research also incorporates and evaluates historical data of varying lengths in the training set, addressing the impact of time-varying building features on model accuracy, thereby alleviating the size of the training set without diminishing prediction accuracy and lowering model training costs. The prediction model was deployed during the winter of 2021 to evaluate the influence of real-world factors like holiday staff movement, weather forecast accuracy, and building thermal inertia, providing valuable insight into the model’s practical performance.

2. Data Sources

In this research, a Nearly Zero-Energy Building of China Academy of Building Research (CABR NZEB) is chosen as the testbed. This is a four-floor office building with a total floor area of 4025 m². The building is designed as the first nearly zero-energy building in cold climate zone under US-China Clean Energy Research Center (CERC) program. This building is used as an experimental building and its energy system is quite diverse and complex. The coupled system of solar energy and ground source heat pump is the main energy system of the building. There are offices and meeting rooms on each floor of the building, and different heating and cooling terminals are used for experimental purposes. The first and fourth floors use a water source variable cooling (WS-VRV) system and a radiator system in summer and winter, respectively, and the second and third floors use floor and ceiling radiation systems, respectively. Detailed information on the building and its energy system can be found in [18,19].

According to the supply water temperature, return water temperature and the water flow rate, the heating (cooling) load can be obtained. The load data from October 2018 to September 2021 were used to build the forecasting model, and the load data in 2020 are shown in Figure 1. The building is a nearly zero-energy building that has excellent passive regulation techniques. During transitional seasons like April, May, and October, the HVAC system does not need to operate, leading to a zero cooling/load data during this period. Meteorological data and energy consumption of electricity equipment were obtained through the building automatic control system, which enriched the data. In addition, weather forecast data were also used in this research. The weather forecast data contain many weather types, which are combined into 10 types: sunny, less cloudy, cloudy, overcast, light rain, moderate rain, heavy rain, snow, overcast and other, and the different types are coded as integer from 1–10. Table 1 shows the candidate parameters and their sources. The first seven items are the final selected input parameters for the model, and the last item is the model’s output parameter.

3. Modelling Methodology

In this section, a detailed roadmap of constructing the proposed load forecasting model will be provided, and the fundamental principles and methodologies utilized will be explicated.

3.1. Research Framework

The research framework is shown in Figure 2. The whole implementation process of the prediction model consists of 9 sub-processes, which can be divided into 5 steps: data preprocessing, input parameter selection, training set division, machine learning model hyperparameter search, and multimodel competitive learning. Six different combinations of inputs are used to evaluate the role of sensitivity analysis and correlation analysis in input dimensionality reduction, and historical time series data of different lengths are used as training sets to evaluate the impact of time-varying building characteristics on model accuracy. In this paper, the forecast model is actually deployed in winter 2021 and the impact of three uncertainty factors on the model accuracy is evaluated. Each process is described in detail in the following sections.

3.2. Data Preprocessing

With the continuous development of BAS and Internet of Things (IOT), a large amount of building operation data has been accumulated in the architecture field, and these data contain a large amount of energy-saving potential. However, when the building automation system is running, some data will be abnormal due to uncontrollable factors such as power failure, maintenance, and equipment damage. Therefore, data preprocessing is needed to identify and process abnormal data and fill in missing data. The common data pre-processing methods are data transformation, data cleaning, data normalization, etc. Data transformation refers to the conversion of those data stored in the form of characters to integer for model calculations. Data cleaning is a process of removing outliers or filling in missing data. The principle of cluster analysis is to cluster similar things together and divide dissimilar things into different categories, especially good at screening abnormal data [20,21]. In this research, K-means cluster analysis method is used to screen out abnormal data. This method divides data into k clusters based on Euclidean distance to assess similarity. The calculation steps of the k-means clustering algorithm are as follows:

Step 1: k random samples are selected as initial cluster centers.

Step 2: Calculate the Euclidean distance between each sample and each cluster center. Each sample is assigned to the nearest cluster center, and the samples of the same cluster center are grouped into one class.

Step 3: The cluster centers are updated by calculating the average of each cluster.

Step 4: Repeat steps 2 and 3 until the cluster centers no longer change.

Abnormal data obtained using cluster analysis can generally be divided into four types: inconsistent errors (e.g., the load value is zero when the heat pump is running), violation errors (e.g., the load value exceeds maximum heat pump load), duplicate data and missing data. All abnormal data are replaced with the mean of the same hour for the previous day and the next day.

When employing machine learning algorithms to construct a load prediction model for buildings, normalization is indispensable. The principal roles of normalization include (1) Normalization can appropriately adjust the scales of each feature, ensuring they’re on the same level. This accelerates the convergence speed of gradient descent and boosts the model’s training efficiency. (2) Normalization can somewhat ameliorate the problem of multicollinearity among features, thereby mitigating the risk of overfitting and enhancing the model’s generalizability. (3) Certain models like SVR hinge on distance computations within the data. Without normalization, there could be marked discrepancies in the measures (dimensions) among the various features, causing the computational results to be potentially dominated by features with larger scales. Normalization ensures all features operate on the same numeric scale, circumventing this issue. Data normalization refers to transforming raw data to a number between 0 and 1.

X_{i}^{'} = \frac{X_{i} - X_{\min}}{X_{\max} - X_{\min}}

(1)

Y_{i}^{'} = \frac{Y_{i} - Y_{\min}}{Y_{\max} - Y_{\min}}

(2)

Among them, X_i and Y_i are the input and output variables, respectively; X_min, X_max, Y_min, and Y_max are the relevant minimum and maximum values of X_i and Y_i, respectively;

X_{i}^{'}

and

Y_{i}^{'}

are the normalized input and output variables.

3.3. Feature Selection

Feature selection aims to identify the most effective subset of features from the myriad of accessible sensor parameters. This not only enhances the model’s precision but also mitigates overfitting and expedites the training process.

The cooling and heating loads of the building are caused by internal gains and external gains. This information is embedded in the time series data obtained from the sensors. These sensor data include three types of historical energy consumption, outdoor conditions and time labels. Buildings behave cyclically, so time-related parameters such as hours of the day, days of the week, and holidays can be used to predict changes in internal returns. Outdoor weather parameters can be used to predict changes in external gain. Reasonable choice of input parameters can improve model interpretability, reduce training time, and reduce the possibility of overfitting, while still providing an accurate fit. This step includes (1) sensitivity analysis to calculate parameter contributions; (2) correlation analysis to find collinear parameters.

First, Sensitivity analysis of candidate parameters was carried out using the random forest (RF) algorithm. The RF algorithm is a classification method based on decision trees. It is composed of multiple CART trees and can be used for classification and regression problems. The RF algorithm randomly selects K rounds of samples from the original sample set, obtains K training sets with mutually independent and repeatable elements, and trains K trees. The regression problem finally outputs the arithmetic mean of the prediction results of all K trees (all trees are of equal importance).

The RF algorithm is essentially a linear classifier, and its application in parameter sensitivity analysis is relatively mature [22]. In this research, the RF algorithm is used to analyze and sort the contribution V(

X^{j}

)of the parameter

X^{j}

to assist the selection of input parameters.

V (X^{j}) = \frac{1}{N} \sum_{(t = 1)}^{N} (e_{t}^{j} - e_{t})

(3)

where, N is the number of decision trees, e_t is the out-of-bag error, and

e_{t}^{j}

is the new out-of-bag error generated after the parameter

X^{j}

is randomly changed. Since the RF algorithm has put-back random sampling, for each sample there is a portion of the training set of decision trees that are not sampled to this sample, then the error in the prediction results produced by these trees through majority voting is the out-of-bag error.

If the random change in the parameter

X^{j}

will lead to a large increase in the out-of-bag error, it means that the parameter

X^{j}

contributes greatly to the prediction accuracy of the model. Parameters that contribute more than 2% to the prediction accuracy of the model will be marked as key parameters.

As mentioned earlier, merely selecting parameters that contribute highly to the predicted outcome is not necessarily a successful feature selection process. It is also necessary to calculate the two correlations between the key parameters to find the collinear parameters among them. The Pearson correlation coefficient is calculated by Equation (4):

ρ_{X, Y} = \frac{cov (X, Y)}{σ_{X} σ_{Y}}

(4)

The Pearson correlation coefficient is defined as the quotient of the covariance and standard deviation between two variables. The closer the absolute value is to 1, the stronger the linear correlation between the two variables. The specific meaning of value is shown in Equation (5) [23].

\{\begin{matrix} |ρ_{X, Y}| = 1.0, completely linear correlation \\ |ρ_{X, Y}| > 0.8, high correlation \\ 0.5 < & |ρ_{X, Y}| < 0.8, moderate correlation \\ 0.3 < & |ρ_{X, Y}| < 0.5, low correlation \\ |ρ_{X, Y}| < 0.3, uncorrelation \end{matrix}

(5)

The key parameters that are independent of each other will be directly used as input parameters, while the key parameters with collinearity need to be simplified according to the calculation results of the correlation analysis. Ultimately, the parameters that are independent of each other and have a significant impact on the prediction results will be selected as the final input of the model.

3.4. Dataset Separated

This research will choose different ways to divide the dataset according to the number of samples. If the samples are sufficient, the training set and the test set can be divided by 7:3 as in most literatures. However, for some buildings such as those just put into operation, it is difficult to obtain large datasets. At this time, k-fold cross validation is used to fully mine the sample information when dividing the dataset to prevent overfitting. k-fold cross-validation refers to dividing the dataset into k subsets, each time taking k − 1 subsets as the training set and one subset as the test set. This process is repeated k times until each of the k subsets is used as test data exactly once.

3.5. Machine Learning Algorithms

Different types of buildings have divergent characteristics and data sizes due to variations in geographical location, functional positioning, and financial conditions. To ensure the effectiveness of the modeling method in establishing a building load model across these diverse datasets, we chose four machine learning methods for parallel training and selected the best model through statistical indicators. Multivariate polynomial regression (MPR) serves as a critical tool in multivariate statistical analysis [24], capable of dealing with complex relationships between factors, like the building load influenced by multiple parameters such as solar radiation, meteorological conditions, and personnel distribution. Support vector regression (SVR) shows robustness to data boundaries and outlier values [25], proving advantageous when dealing with data missing and data imbalance. Multilayer perceptron (MLP) is effective in dealing with large-scale data and boasts the ability to automatically optimize its structure and weights using deep learning [26], particularly useful when handling considerable amounts of data and complex variable interaction relationships. Lastly, extreme gradient boosting (XGBoost), lauded for its flexibility and scalability, comes into play when dealing with large-scale data [27]. Its superb computational efficiency and model performance have earned it commendation amongst researchers. These four selected methods, covering categories from linear to nonlinear, from individual models to ensemble methods, are adaptable to different data features and application scenarios [28].

The adjustment of hyperparameters plays a decisive role in the accuracy and training efficiency of the model. Therefore, to eliminate the uncertainty of manual intervention, we automatically select the best hyperparameters through algorithms. When dealing with the MPR model, we established regression models with their highest terms ranging from 1 to 5, and selected the one with the best accuracy. After multiple iterations, we used the best network after multiple iterations (BNMI) algorithm to determine the optimal parameters of mlp, including the number of hidden layers, the number of nodes, the weight of links between nodes, and bias values [26]. In addition, the grid search algorithm has played a positive role in enhancing our optimization of hyperparameters for SVR and XGBoost [29]. The range of hyperparameters for each model is detailed in Appendix A.

All models are implemented in Python 3.8; MPR, and SVR use the Scikit-learn library, XGBoost employs the XGBoost library, and MLP uses the Tensorflow library.

3.6. Evaluation Criteria

In order to evaluate the accuracy of the four modeling techniques, and then select the model most suitable for the dataset, this paper uses the predictive coefficient of determination (R²) and mean square error (MSE) as evaluation metrics.

The predictive coefficient of determination (R²) is a statistical metric used to measure the goodness of fit of the prediction model, indicating the extent to which the model captures the variability of the data. The value of R² ranges from 0 to 1. When R² equals 1, it signifies that the model perfectly fits the data, while an R² of 0 indicates that the model fails to capture any variability in the data.

R^{2} = \frac{\sum_{i = 1}^{n} {[y_{predicted} - {\bar{y}}_{observed}]}^{2}}{\sum_{i = 1}^{n} {[y_{observed} - {\bar{y}}_{observed}]}^{2}}, i = 1, 2, \dots, n

(6)

In the formula, y_predicted represents the model’s predicted value, Y_observed is the actual value, and ŷ_observed is the average of the actual values. The numerator represents the explained sum of squares (ESS), which is the sum of squares of the prediction deviations from the true mean value. It can be understood as the amount of data variability the model explains. The denominator is the total sum of squares (TSS), which is the sum of squares of the deviations of each data point from the true mean value, indicating the total variability in the data.

Mean square error (MSE) is a common performance metric for regression models that quantifies the average square difference between the predicted and actual values. A smaller MSE signifies better prediction performance of the model, indicating lower errors.

M S E = \frac{\sum_{i = 1}^{n} {[y_{observed} - y_{predicted}]}^{2}}{n}, \in [0, + \infty), i = 1, 2, \dots, n

(7)

In the formula, n represents the total number of samples.

Although R² and MSE are sufficient to evaluate the accuracy of a model’s predictions, the root mean square error (RMSE) and mean absolute error (MAE) can provide additional perspectives for assessing and understanding model performance.

The root mean square error (RMSE) is the square root of the mean square error (MSE). Given that it is in the same unit as the actual values, it provides a more intuitive perspective for the interpretation and reporting of model results, facilitating a clearer explanation of the model’s predictive performance.

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {[y_{observed} - y_{predicted}]}^{2}}{n}}, \in [0, + \infty), i = 1, 2, \dots, n

(8)

The Mean Absolute Error (MAE) is the average of the absolute differences between the individual observed values and the predicted ones. Compared to RMSE, MAE is less sensitive to extreme errors, and it better represents the average level of predictive errors of the model. This is particularly meaningful in scenarios that require a focus on average prediction errors or where there is a high demand for prediction result stability.

M A E = \frac{\sum_{i = 1}^{n} |y_{observed} - y_{predicted}|}{n}, \in [0, + \infty), i = 1, 2, \dots, n

(9)

The coefficient of variation of the root mean square error (CV(RMSE)) is computed by dividing the RMSE by the average of the measured values. The CV(RMSE) is the proportion of RMSE relative to the average measured values, expressed as a percentage. Given its normalized nature, this value is not affected by the scale of the data and can serve as a tool for comparing different datasets of the same type. According to ASHRAE Guideline 14, when conducting hourly calculations, a model’s error (CV(RMSE)) should be within ±30% [30].

C V (R M S E) = R M S E / {\bar{y}}_{observed}

(10)

4. Results and Discussion

This section is set to reveal the research findings and engage in an in-depth discussion on the respective results. Firstly, it will start with the discussion on the impact of feature selection, which is regarded as one of the key steps affecting the performance of the model. Next, it will investigate how the size of the training set affects the accuracy of the model, while how the factors of uncertainty in the actual deployment process impact the model’s precision will also be put under discussion. Lastly, limitations existing in the research will be revealed, and directions for future research will be proposed.

4.1. Influence of Feature Selection

The sensitivity analysis results of the parameters to be selected in Table 1 are shown in Figure 3. In the case of the load prediction for nearly zero-energy buildings, a series of parameters have been considered. Due to their unique insulation and airtightness, the effect of these parameters on prediction varies from conventional buildings.

Among these parameters, those with an impact ratio exceeding 2% have been identified as key parameters for further correlation analysis. Among them, as the heat pump, a key appliance to adjust building loads, has a primary influence on our predictions as its operating conditions directly affect the load. Parameters such as week, solar radiation, weather type, whether it is a holiday, and lighting energy consumption reflect the effects of personnel dynamics, sunlight conditions, climate factors, usage patterns, and building usage status on the load. Similarly, parameters like hour, temperature, pressure, month, wind speed, and year indirectly show the influence of time progression, ambient thermal environment, airflow, and seasonal changes on the load. In addition to these, among the parameters with around 2% impact are day, humidity, and elevator energy consumption. These parameters mainly reflect the building’s operational status, indoor air humidity control, and the influence of building usage status on the load.

The contribution of rainfall, power consumption of other equipment, and wind direction are all less than 2%, so they are excluded from the parameters to be selected, and no correlation calculation is performed. This is because the appliance power is relatively small, and the outstanding insulation and airtightness of the building mean that rainfall and wind direction have only a slight effect on the interior temperature and humidity of the building, and thereby, a relatively limited impact on the load.

Figure 4 presents a correlation matrix, a crucial tool for understanding the relationships between input and output variables, as well as pairwise relationships among the inputs. After a correlation analysis of key parameters, and referring to the results of the sensitivity analysis for parameter contribution, we adopted the following strategy to optimize model performance and reduce complexity. First, there is a high correlation among the four parameters of heat pump electricity, elevator electricity, lighting electricity, and solar irradiance. Therefore, we only retain heat pump electricity, which has the highest contribution in sensitivity analysis. Second, considering the correlation of relative humidity with weather conditions and solar radiation, and the significant correlation between year and month, temperature, and air pressure, we exclude relative humidity and year, both of lower contribution. Among the three highly correlated parameters, weather, temperature, and air pressure, we choose to keep weather as an input variable, given its largest contribution to the model. Since the holiday variable already encompasses much of the weekend information, the weekend is excluded from the input parameters.

Out of the 18 candidate parameters, 7 are reserved as the final input parameters, as listed in Table 2. Due to the periodicity of building behavior and the periodicity of meteorological conditions, time-related parameters can reflect the changes in building internal gain and external gain at the same time. The heat pump power consumption at the same time of the previous day contains information such as the working mode and load level of the building HVAC system. Weather type contains changes in outdoor weather parameters, which can reflect the external gain of the building. Within the framework of this study, digit have been used to denote weather types as follows: from ‘1’ to ‘10’, they stand for sunny, less cloudy, cloudy, overcast, light rain, moderate rain, heavy rain, snow, overcast and other. Changes in wind speed will affect the flow heat transfer coefficient, which in turn affects the heat loss of the building’s external envelope. It also affects the negative pressure of natural or mechanical ventilation systems, which affects the entry of outdoor air into the building and the rate of air exchange. Given that the object of study is an office building, the status of the current day being a holiday or workday factors into personnel mobility. Here, ‘0’ has been used to denote a regular working day, while ‘1’ signifies a holiday. The results of the selection of input parameters conform to physical cognition.

Table 3 conducts six input parameter combinations, named “I0” to “I5”, to evaluate the impact of feature selection on model performance. Figure 5 shows the variation of R² and MSE of the best model among MPR, MLP, SVR and XGBoost under different input selections. The model built using the input parameters I0 has the best performance whether it is R² or MSE. This shows that the input parameter selection method used in this paper achieves its intended purpose and can improve the model accuracy while reducing the amount of calculation. The MSE of the model established with the input parameter I2 is only 0.0006 higher than that of the model established with the input parameter I1, but its R² is improved by 0.014. This verifies the calculation result of the contribution of the candidate parameters by the random forest algorithm.

The R² of the model established by the input parameters I4 and I5 are 0.91 and 0.89, respectively, and the model has certain reliability. However, both MSEs are large, and the model will have large errors when predicting certain moments. The comparison results of I1, I4, and I5 show that both meteorological information and historical energy consumption information are indispensable input items in the load forecasting model, and the absence of any item will lead to a greater loss of forecasting accuracy. Surprisingly, the model built with the input parameter I3 has the worst prediction accuracy. This is because the nonlinear relationship between the candidate parameters and the target parameters cannot be mined by the Pearson correlation coefficient, resulting in insufficient input information for the model.

In addition, the input parameters are different, and the results of the four machine learning competitions are also different. XGBoost is stable and has high accuracy in various input parameter combinations. When there are more input parameters such as I2, MLP can better approximate this nonlinear process. The SVR and MPR has the best prediction accuracy when there are few input parameters such as I3.

4.2. Influence of the Training Sets Size

This section shows the difference in accuracy when the training set size varies under fixed input conditions. On the one hand, buildings and their energy systems are time-varying, with their energy consumption and load profiles changing over time. On the other hand, the computational cost increases with the number of samples in the training set. Whether to model with all sensor data or with recent data is a question worth discussing. Using heat load prediction as a focal point, this study investigates the effect the size of a training set may have on prediction accuracy.

Based on earlier results, the selected input parameters are the month of the year, the day of the month, the hour of the day, whether or not it is a holiday, the power consumption of the heat pump at the same time on the previous day, the weather, and the wind speed. The entire sample comprises three years’ worth of historical data from October 2018 to September 2021.

With January 2021 as the test set, the training set is divided in the following manner: Initially, data from December 2020 is used as the training set. Subsequently, data from November 2020 are added, followed by the addition of data from October 2020. This pattern of gradually adding data, month by month, continues until the training set includes all available data.

As observable from Figure 6, the impact that the size of the training set has on different modeling techniques is not uniform. In line with the conclusions derived from theoretical analysis, there does not exist a straightforward linear relationship between the size of the training set and model accuracy. The prediction accuracy of the four models initially increases rapidly with the enrichment of the training set, reaching a peak upon the addition of data from the 3rd month. Subsequently, and until the inclusion of data from the 12th month, the prediction accuracy of all four models fluctuates with the enlargement of the training set. However, while the overall accuracy of MLP exhibits an upward trend, that of MPR tends to decrease. The prediction accuracies of the remaining two models, SVR and XGBoost, fluctuate less and remain relatively steady.

The initial three months of data added to the training set belong to the same heating season as the prediction target, explaining the swift rise in the model’s prediction accuracy during this period. The prediction accuracy fluctuates because the subsequently added data do not fall under the same season as the prediction target. During this period, the patterns of building energy use, building occupants, outdoor weather parameters, and others differ from those of the prediction target, leading to a greater influx of noise data.

Contrary to other models, the accuracy of MLP can still maintain an upward trajectory even after adding data from differing seasonal periods. This can be attributed to MLP, being a type of deep neural network, which possesses multiple hidden layers (2–10, automatically tuned according to the training samples). These hidden layers facilitate the automatic creation of hidden features in the presence of complex data, where manual feature extraction proves challenging. As a result, even training samples from different seasons can help MLP improve its structural resilience.

All models gain a huge improvement in prediction accuracy after adding data from the twelfth month (January of the previous year). This is because this training set contains the same meteorological data and energy use patterns as the prediction target, and the prediction model can learn a more comprehensive mapping relationship, which helps to improve accuracy when the input changes.

As additional samples continue to be added to the training set, the prediction accuracy of the model no longer improves, and in some cases, it even exhibits a slight decrease. This reduction in model accuracy can be attributed to the time-variant characteristics of the building and its energy system. For instance, the building’s predicted load increases year on year due to factors such as a rise in employee numbers. Older data contain a substantial amount of failure information, which can disturb the model. This scenario continues until the addition of data from the 24th month, after which the model’s accuracy follows a pattern of improvement followed by a decrease, driven by similar reasons.

The highest model accuracy is achieved when the size of the training set is 24 months, with the optimal model being MLP, demonstrating an R² of 0.92 and MSE of 3. The subsequent highest accuracy for the training set size is 12 months, and the optimal model is XGBoost, resulting in an R² of 0.92 and MSE of 3.2. The difference between these two is not substantial.

Furthermore, Figure 6 also indicates that the modeling techniques yielding optimum performance at various sizes of training sets differ. When only 1–2 months of historical data are available, MPR or SVR render the highest prediction accuracy. XGBoost performs optimally when 3–11 months of data are accessible for modeling. Both XGBoost and MLP are likely to manifest the highest prediction accuracy when the historical data used for modeling exceed 12 months.

Figure 7 demonstrates the impact of the training set size on the time cost of competitive learning across the four modeling techniques. The hardware environment used is an ASUS ROG Zephyrus M GU502 (sourced from ASUS, Taiwan, China) laptop equipped with an Intel^®core™ i7-8750H 3.5 GHz CPU, an RTX 2070 GDDR6@6GB Graphics Processing Unit (GPU), and 32.00 GB of Random Access Memory (RAM). It can be noted that the time cost of training exhibits a linear relationship with the size of the training set.

Modeling with 24 months of historical data, compared to that with 12 months of historical data, leads to a slightly higher forecast accuracy. However, the associated computational cost escalates by almost 73%. Excessive data can lead to information redundancy. Therefore, the utilization of the most recent year of historical data allows for minimization of computational costs while preserving accuracy.

4.3. Actual Deployment and Influence of Uncertainty Factors

The prediction results in the previous sections are based on actual sensor data, a common practice in prior studies. Nonetheless, practical forecasting necessitates the use of weather forecast data rather than measured meteorological data—a factor that previous studies have somewhat overlooked in relation to the impact of uncertainties in weather forecast data on model predictions. Moreover, actual operation of the building is subject to uncertainties such as unforeseen holidays and overtime work.

For this study, four individual machine learning prediction models were built based on the preceding research using data from 2020, with the parameters mentioned in Table 2 serving as inputs. The load forecasting program was effectively deployed from 24 November 2021, to 13 February 2022, with the aim of investigating the impacts of various uncertainties on the model’s forecasting efficiency.

Weather data, as a critical input element, plays an essential role in constructing a building load forecasting model. Significant uncertainties exist between the use of actual historical weather data in the test set and forecast weather data in the deployment set. As shown in Table 4, four machine learning algorithms—XGBoost, MLP, SVR, and MPR—can all meet the requirements of CV(RMSE) for model error specified by ASHRAE when actually deployed. However, these four models display different characteristics.

XGBoost presents the most dramatic loss in R² and the largest gain in MSE between the test set and deployment set, indicating its high sensitivity to weather forecast data uncertainties. Despite this, XGBoost possesses an exceptional R² performance and relatively excellent MSE.

The behavior of MLP and SVR is like XGBoost but is more moderate in terms of R² loss and MSE increase, demonstrating their relative insensitivity to weather prediction uncertainties. MLP, as an artificial neural network model, effectively mitigates the impact of weather forecasting inaccuracies through its exceptional nonlinear fitting ability. Similarly, SVR with a consistent R² and MSE may owe its performance to its strong resistance to interference and its ability to handle outliers.

However, MPR performs the worst among all models, with generally larger increases in R² loss and MSE. As a linear model in polynomial regression, its ability to fit nonlinear and variable weather forecast data is relatively weak, which ultimately leads to a substantial impact of weather prediction uncertainties.

Considering the performance of these four algorithms, the ranking is as follows: XGBoost > MLP > SVR > MPR. Potential reasons may be related to the characteristics of the algorithms and the characteristics of the data. As an ensemble learning method based on gradient boosting, XGBoost captures complex patterns in the data due to its strong model complexity and fitting ability. MLP, as a neural network model, often requires a large amount of training data and is sensitive to the choice of parameters and network structure. SVR shows high accuracy and is friendly to small sample training but requires a suitable kernel function when dealing with nonlinear problems and multidimensional data. In contrast, MPR, as a linear model, handles linear problems well but performs poorly with nonlinear and high-dimensional data.

Figure 8 and Figure 9 illustrate the specific performance of the XGBoost model during its deployment, further underscoring how the model’s accuracy can be influenced by these uncertainties.

As the input parameters do not incorporate the building’s own characteristics such as insulation properties, the thermal inertia of the building can sometimes act as an uncertainty factor in the prediction result. The prediction targets achieve a nearly zero-energy building energy efficiency level, where the envelope is highly insulated and the thermal storage capacity of the building itself and the facilities within the building (i.e., floors, ceilings, walls, and furniture) can be fully leveraged. During the first week of heating, the building itself is progressively heated. The load prediction is influenced by the building’s thermal inertia, and the predicted value initially high and later low, before gradually stabilizing at the normal level. Accurate building load forecasts may prove challenging to attain on the first day back to work following a holiday. However, building thermal inertia will only impact prediction accuracy in these specific cases, and hence it may not be cost-effective to utilize building information as a model input. One potential solution might be to build separate predictive models using data from these particular periods, and switching between the two models depending on the time.

Uncertainty in weather forecasting can also impact the accuracy of the model’s predictions. For instance, from 12 December 2021, to 24 December 2021, the weather forecast predicted sunny weather, but the actual conditions were hazy. As a result, many of the sample points in Figure 8 exhibit an error of more than +30% (the predicted load is substantially greater than the measured load). One potential solution could be to add both weather forecast data and measured meteorological data to the candidate parameters. Exposing the model to uncertain weather data during the training stage enables the model to become aware of the data uncertainty, gradually discovering ways to handle this uncertainty during the training process. Currently, it is challenging to retrospectively obtain weather forecast data, so it is necessary to collect and store weather forecast data for future building load models.

The final uncertainty factor that significantly influences model accuracy is human movement during holidays. The predicted target is an office building, which is theoretically unoccupied during weekends and holidays, with employees not present. However, employees may still visit the office for work purposes, events, or personal matters. Consequently, the building’s occupancy rate and equipment usage during weekends and holidays becomes unpredictable and subject to a high degree of uncertainty. As such, predicting the occupancy rate and equipment utilization rate of the building during weekends and holidays proves challenging, and this variability largely affects the prediction accuracy of the model. In Figure 9, the forecasted load on weekends is stable at 5–10 kW, while the measured value fluctuates between 5 and 20 kW with the movement of people.

4.4. Limitations and Future Directions

The building load prediction method we propose, which competes through four types of machine learning techniques, has the following advantages:

(1): Simplicity. Users can develop a high-accuracy data-driven building load prediction model without knowing the detailed information of data-driven models and building operational data. This can save users a great deal of time.
(2): High accuracy. By automatically optimizing the model training process, the method we propose can achieve higher prediction accuracy.
(3): High applicability. The machine learning methods used cover categories from linear to nonlinear, from single models to ensemble methods, suitable for different data features and application scenarios.

Nonetheless, there are still a few limitations present in this study that prompt further detailed research:

(1): Although this study examines the three most significant uncertainty factors that impact the prediction accuracy during actual deployment, these issues have yet to be fundamentally resolved.
(2): This paper employs statistical indices to select the final model. However, these indices do not directly reflect the sensitivity of the model to the uncertainty of the input data during the actual deployment, especially when this uncertainty comes from uncontrollable external environmental factors, such as weather forecasts.
(3): The amalgamation of physical mechanisms and data-driven methods might efficiently enhance the accuracy of load prediction during actual deployment and provide a more robust interpretation of prediction models.

5. Conclusions

Building load prediction holds importance for HVAC control, thermal storage operation, and smart grid management. This study applied sensitivity and correlation analysis to select input parameters. Four model identification methods, namely multivariate polynomial regression (MPR), support vector regression (SVR), multilayer perceptron (MLP), and extreme gradient boosting (XGBoost), were used. The purpose was to adapt to different datasets and minimize human requirements in the modeling process.

This study assessed the impact of input parameter selection and training set size on model accuracy. The proposed approach was implemented in a nearly zero-energy office building. The impact of three common uncertainties was analyzed based on model deployment results. The primary findings include

(1): The multicollinearity problem among model input parameters was addressed. This was achieved through the Pearson correlation coefficient and the random forest algorithm. These methods allowed for correlation and sensitivity analysis of each input parameter. Further, the proposed method of input parameter selection improved prediction accuracy and reduced 60% of the input parameters.
(2): This study observed a linear relationship between training set size and model training time. However, a larger training set did not always enhance model prediction performance. An overly large training set could include invalid information, affecting model accuracy. The most recent year’s historical data proved sufficient for predictive accuracy.
(3): This research explored advanced modeling techniques to accommodate load differences in different buildings. These differences might arise due to varying factors like latitude, longitude, and climate. Independent training and optimization of datasets for different buildings were advisable. Generally, SVR and MPR showed higher prediction accuracy with fewer samples. XGBoost and MLP performed better when the sample size was sufficient.
(4): During deployment, model performance declined due to various uncertainties. The most significant impacts on prediction accuracy were weather forecast accuracy and personnel movements during holidays. Building thermal inertia also affected model prediction accuracy, notably during the initial heating week.

Author Contributions

Conceptualization, Z.Y.; data curation, H.L.; formal analysis, H.L.; methodology, Z.L. and Z.Y.; software, H.Q.; writing—original draft, H.Q.; writing—review and editing, Z.L. and Y.Z.; investigation, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data available on request due to restrictions.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

X_i	Input variables (-)
Y_i	Putput variables (-)
X_min	The relevant minimum of X_i (-)
X_max	The relevant maximum of X_i (-)
Y_min	The relevant minimum of Y_i (-)
Y_max	The relevant maximum of Y_i (-)
$X_{i}^{'}$	The normalized input (-)
$Y_{i}^{'}$	The normalized output (-)
ρ_X,Y	Pearson correlation coefficient (-)
cov(X,Y)	Covariance (-)
$σ_{X}$	Standard deviation (-)
n	Sample size (-)

Abbreviations

HVAC	Heating, ventilation and air conditioning
NZE	Net zero emissions
NZEB	Nearly zero-energy buildings
MPC	Model predictive control
SVM	Support vector machine
SVR	Support vector regression
ANN	Artificial neural network
MLP	Multilayer perceptron
MPR	Multivariate polynomial regression
XGBoost	Extreme gradient boosting
MSE	Mean square error
MAE	Mean absolute error
RMSE	Root mean squared error
R²	Coefficient of determination
ESS	Explained sum of squares
TSS	Total sum of squares
CABR	China academy of building research
RBFs	Radial basis functions
CERC	Clean energy research center
ASHP	Air sourced heat pump
PAU	Primary air unit
BNMI	Best network after multiple iterations
BAS	Building automatic control system
RF	Random forest algorithm
CART	Classification and regression tree
IOT	Internet of things
GPU	Graphics processing unit
RAM	Random Access Memory
WS-VRV	Water source variable cooling system

Appendix A. Hyperparameter Adjustment Range

Table A1. Grid search hyperparameters of SVR.

Hyperparameters	Description	Grid Range
Kernel	Mapping methods for higher-dimensional spaces	RBF
C	The larger the value of the punishment coefficient C, the less smooth the decision boundary	[0.1, 0.5, 1, 2, 3, 4, 5, 10, 50, 100, 500]
gamma	The coefficient of the kernel; the larger the value, the more complex the model	[0.1, 10]

Table A2. Grid search hyperparameters of XGBoost.

Hyperparameters	Description	Grid Range
n_estimators	Number of decision trees in the model	[5, 10, 50, 75, 100, …, 500]
Max_depth	Maximum depth of each decision tree	[1, 2, 3, 4, 5, 10, 50, 100]
learning_rate	Weight reduction factor for each decision tree	[0.001, 0.01, 0.1, 0.2, 0.3, 0.5]
min_child_weight	Minimum child node weight threshold	1
gamma	Loss reduction threshold due to decision tree splitting	[0, 0.1, 0.5,1]

Table A3. BNMI search hyperparameters of MLP.

Hyperparameters	Description	Grid Range
Number of hidden layers	Each hidden layer is a fully connected layer	[2, 10]
Number of neurons	Number of neurons per hidden layer	[5, 10, 15, 20, …, 300]

Table A4. Grid search hyperparameters of MPR.

Hyperparameters	Description	Grid Range
Highest number of terms	The highest degree of terms in polynomials	[1, 2, 3, 4, 5]

References

Yuan, T.; Ding, Y.; Zhang, Q.; Zhu, N.; Yang, K.; He, Q. Thermodynamic and economic analysis for ground-source heat pump system coupled with borehole free cooling. Energy Build. 2017, 155, 185–197. [Google Scholar] [CrossRef]
Attia, S.; Eleftheriou, P.; Xeni, F.; Morlot, R.; Ménézo, C.; Kostopoulos, V.; Betsi, M.; Kalaitzoglou, I.; Pagliano, L.; Cellura, M.; et al. Overview and future challenges of nearly zero energy buildings (nZEB) design in Southern Europe. Energy Build. 2017, 155, 439–458. [Google Scholar] [CrossRef]
He, N.; Qian, C.; Liu, L.; Cheng, F. Air conditioning load prediction based on hybrid data decomposition and non-parametric fusion model. J. Build. Eng. 2023, 80, 108095. [Google Scholar] [CrossRef]
Jouhara, H.; Yang, J. Energy efficient HVAC systems. Energy Build. 2018, 179, 83–85. [Google Scholar] [CrossRef]
Homod, R.Z. Analysis and optimization of HVAC control systems based on energy and performance considerations for smart buildings. Renew. Energy 2018, 126, 49–64. [Google Scholar] [CrossRef]
Ma, Z.; Wang, S. Supervisory and optimal control of central chiller plants using simplified adaptive models and genetic algorithm. Appl. Energy 2010, 88, 198–211. [Google Scholar] [CrossRef]
Wang, S.; Xu, X. Simplified building model for transient thermal performances estimation using GA-based parameter identification. Int. J. Therm. Sci. 2006, 45, 419–432. [Google Scholar] [CrossRef]
Roy, S.S.; Samui, P.; Nagtode, I.; Jain, H.; Shivaramakrishnan, V.; Mohammadi-Ivatloo, B. Forecasting heating and cooling loads of buildings: A comparative performance analysis. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 1253–1264. [Google Scholar] [CrossRef]
O’Neill, Z.; Narayanan, S.; Brahme, R. Model-based Thermal Load Estimation in Buildings. In Proceedings of the Fourth National Conference of IBPSA-USA, New York, NY, USA, 11–13 August 2010. [Google Scholar]
Rockett, P.; Hathway, E.A. Model-predictive control for non-domestic buildings: A critical review and prospects. Build. Res. Inf. 2017, 45, 556–571. [Google Scholar] [CrossRef]
Jallal, M.A.; Gonzalez-Vidal, A.; Skarmeta, A.F.; Chabaa, S.; Zeroual, A. A hybrid neuro-fuzzy inference system-based algorithm for time series forecasting applied to energy consumption prediction. Appl. Energy 2020, 268, 114977. [Google Scholar] [CrossRef]
Guo, Q.; Tian, Z.; Ding, Y.; Zhu, N. An improved office building cooling load prediction model based on multivariable linear regression. Energy Build. 2015, 107, 445–455. [Google Scholar]
Huang, Y.; Li, C. Accurate heating, ventilation and air conditioning system load prediction for residential buildings using improved ant colony optimization and wavelet neural network. J. Build. Eng. 2021, 35, 101972. [Google Scholar] [CrossRef]
Liang, R.; Le-Hung, T.; Nguyen-Thoi, T. Energy consumption prediction of air-conditioning systems in eco-buildings using hunger games search optimization-based artificial neural network model. J. Build. Eng. 2022, 59, 105087. [Google Scholar] [CrossRef]
Tao, Y.; Yan, H.; Gao, H.; Sun, Y.; Li, G. Application of SVR optimized by modified simulated annealing (MSA-SVR) air conditioning load prediction model. J. Ind. Inf. Integr. 2019, 15, 247–251. [Google Scholar] [CrossRef]
Fan, C.; Yan, D.; Xiao, F.; Li, A.; An, J.; Kang, X. Advanced data analytics for enhancing building performances: From data-driven to big data-driven approaches. In Building Simulation; Tsinghua University Press: Beijing, China, 2021; Volume 14, pp. 3–24. [Google Scholar]
Zhang, C.; Li, J.; Zhao, Y.; Li, T.; Chen, Q.; Zhang, X. A hybrid deep learning-based method for short-term building energy load prediction combined with an interpretation process. Energy Build. 2020, 225, 110301. [Google Scholar] [CrossRef]
Li, H.; Xu, W.; Yu, Z.; Wu, J.; Sun, Z. Application analyze of a ground source heat pump system in a nearly zero energy building in China. Energy 2017, 125, 140–151. [Google Scholar] [CrossRef]
Li, H.; Zhang, S.; Yu, Z.; Wu, J.; Li, B. Cooling operation analysis of multienergy systems in a nearly zero energy building. Energy Build. 2021, 234, 110683. [Google Scholar] [CrossRef]
Jing, J.; Ke, S.; Li, T.; Wang, T. Energy method of geophysical logging lithology based on K-means dynamic clustering analysis. Environ. Technol. Innov. 2021, 23, 101534. [Google Scholar] [CrossRef]
Hong, Y.; Ezeh, C.I.; Deng, W.; Hong, S.H.; Peng, Z.; Tang, Y. Correlation between building characteristics and associated energy consumption: Prototyping low-rise office buildings in Shanghai. Energy Build. 2020, 217, 109959. [Google Scholar] [CrossRef]
Mellit, A.; Kalogirou, S.A. Artificial intelligence techniques for photovoltaic applications: A review. Prog. Energy Combust. Sci. 2008, 34, 574–632. [Google Scholar] [CrossRef]
Guo, Y.; Wang, J.; Chen, H.; Li, G.; Liu, J.; Xu, C.; Huang, R.; Huang, Y. Machine learning-based thermal response time ahead energy demand prediction for building heating systems. Appl. Energy 2018, 221, 16–27. [Google Scholar] [CrossRef]
Tsai, C.L.; Chen, W.T.; Chang, C.S. Polynomial-Fourier series model for analyzing and predicting electricity consumption in buildings. Energy Build. 2016, 127, 301–312. [Google Scholar] [CrossRef]
Liu, Y.; Chen, H.; Zhang, L.; Wu, X.; Wang, X.J. Energy consumption prediction and diagnosis of public buildings based on support vector machine learning: A case study in China. J. Clean. Prod. 2020, 272, 122542. [Google Scholar] [CrossRef]
Afram, A.; Janabi-Sharifi, F.; Fung, A.S.; Raahemifar, K. Artificial neural network (ANN) based model predictive control (MPC) and optimization of HVAC systems: A state of the art review and case study of a residential HVAC system. Energy Build. 2017, 141, 96–113. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Bagherzadeh, F.; Shafighfard, T.; Khan, R.M.A.; Szczuko, P.; Mieloszyk, M. Prediction of maximum tensile stress in plain-weave composite laminates with interacting holes via stacked machine learning algorithms: A comparative study. Mech. Syst. Signal Process. 2023, 195, 110315. [Google Scholar] [CrossRef]
Fayed, H.A.; Atiya, A.F. Speed up grid-search for parameter selection of support vector machines. Appl. Soft Comput. 2019, 80, 202–210. [Google Scholar] [CrossRef]
ASHRAE. Measurement of Energy, Demand, and Water Savings. ASHRAE Guideline 14–2023. 2023. Available online: https://technologyportal.ashrae.org/journal/articledetail/2473 (accessed on 6 January 2024).

Figure 1. Hourly heating/cooling load of the building.

Figure 2. Comprehensive framework of predictive model.

Figure 3. Contribution of candidate parameters to prediction accuracy.

Figure 4. Candidate parameter correlation heatmap.

Figure 5. Prediction accuracy for different combinations of input parameters.

Figure 6. Prediction accuracy at different training set sizes.

Figure 7. Training time for different training set sizes.

Figure 8. Scatter plot of XGBoost performance during actual deployment.

Figure 9. Performance during actual XGBoost deployments.

Table 1. Candidate parameters and their statistical metrics.

Data Name and Unit	Min	Max	Median	Average	Data Source
Wind sensor (level)	0.0	7.0	2.0	1.6	Roof of the building, outdoor
Heat pump electricity (kWh)	0.0	69.6	3.2	10.8	Ammeter
weather types (-)	1.0	10.0	1.0	2.3	weather forecast
month (-)	1	12	-	-	calendar
day (-)	1	31	-	-	calendar
hour (-)	0	23	-	-	calendar
holiday (-)	0	1	-	-	calendar
Ambient temperature (℃)	−11.7	39.1	19.9	16.8	Roof of the building, outdoor
Ambient relative humidity (%)	3.9	95.6	64.9	62.9	Roof of the building, outdoor
Solar radiation sensor (w/m²)	0.0	762.51	4.51	140.7	Roof of the building, outdoor
Air pressure (kPa)	98.9	104.2	101.8	101.6	Roof of the building, outdoor
Lighting electricity (kWh)	0.0	27.8	6.6	9.8	Ammeter
Elevator electricity (kWh)	0.0	1.4	0.2	0.2	Ammeter
Other electricity (kWh)	0.0	16	2.8	3.2	Ammeter
wind direction (-)	0.0	7.0	3.0	3.9	weather forecast
Rainfall (mm)	0	38	0	0.1	weather forecast
year (-)	2018	2021	-	-	calendar
week (-)	1	7	-	-	calendar
Load	87.6	0.0	3.2	10.67	Chilled water temperature and flow sensor

Table 2. Load forecasting model input parameters.

Input Parameters	Ranges
Heat pump power consumption (kWh)	0.0~69.6
Weather types (-)	1~10
Wind (level)	0~7
Month of the year (-)	1~12
Day of the month (-)	1~31
Hour of the day (-)	0~23
Holiday (-)	0.1

Table 3. Different input selections.

Input No.	Descriptions
I0	With the selected parameters from Table 2
I1	With all available data
I2	With the top 10 contributing parameters from Figure 3
I3	With high correlation with target parameters
I4	Without historical meteorology meter data
I5	Without historical electric meter data

Table 4. Performance of four models in the test set and during actual deployment.

Model	MPR	MLP	SVR	XGBoost
test set R²	0.76	0.90	0.79	0.96
Deployment set R²	0.68	0.84	0.73	0.86
R² loss	0.08	0.06	0.06	0.1
test set MSE	0.015	0.003	0.006	0.059
Deployment set MSE	7.34	3.05	6.61	5.27
Deployment set CV(RMSE)	17.4%	11.2%	16.5%	14.8%
MSE loss	7.325	3.047	6.604	5.211

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qin, H.; Yu, Z.; Li, Z.; Li, H.; Zhang, Y. Nearly Zero-Energy Building Load Forecasts through the Competition of Four Machine Learning Techniques. Buildings 2024, 14, 147. https://doi.org/10.3390/buildings14010147

AMA Style

Qin H, Yu Z, Li Z, Li H, Zhang Y. Nearly Zero-Energy Building Load Forecasts through the Competition of Four Machine Learning Techniques. Buildings. 2024; 14(1):147. https://doi.org/10.3390/buildings14010147

Chicago/Turabian Style

Qin, Haosen, Zhen Yu, Zhengwei Li, Huai Li, and Yunyun Zhang. 2024. "Nearly Zero-Energy Building Load Forecasts through the Competition of Four Machine Learning Techniques" Buildings 14, no. 1: 147. https://doi.org/10.3390/buildings14010147

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Nearly Zero-Energy Building Load Forecasts through the Competition of Four Machine Learning Techniques

Abstract

1. Introduction

2. Data Sources

3. Modelling Methodology

3.1. Research Framework

3.2. Data Preprocessing

3.3. Feature Selection

3.4. Dataset Separated

3.5. Machine Learning Algorithms

3.6. Evaluation Criteria

4. Results and Discussion

4.1. Influence of Feature Selection

4.2. Influence of the Training Sets Size

4.3. Actual Deployment and Influence of Uncertainty Factors

4.4. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclature

Abbreviations

Appendix A. Hyperparameter Adjustment Range

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI