Thermal Load Prediction in Residential Buildings Using Interpretable Classification

Abdel-Jaber, Fayez; Dirks, Kim N.

doi:10.3390/buildings14071989

Open AccessArticle

Thermal Load Prediction in Residential Buildings Using Interpretable Classification

by

Fayez Abdel-Jaber

^* and

Kim N. Dirks

Department of Civil and Environmental Engineering, Faculty of Engineering, University of Auckland, Auckland 1010, New Zealand

^*

Author to whom correspondence should be addressed.

Buildings 2024, 14(7), 1989; https://doi.org/10.3390/buildings14071989

Submission received: 2 March 2024 / Revised: 17 May 2024 / Accepted: 27 May 2024 / Published: 1 July 2024

(This article belongs to the Special Issue AI and Data Analytics for Energy-Efficient and Healthy Buildings)

Download

Browse Figures

Versions Notes

Abstract

:

Energy efficiency is a critical aspect of engineering due to the associated monetary and environmental benefits it can bring. One aspect in particular, namely, the prediction of heating and cooling loads, plays a significant role in reducing energy use costs and in minimising the risks associated with climate change. Recently, data-driven approaches, such as artificial intelligence (AI) and machine learning (ML), have provided cost-effective and high-quality solutions for the prediction of heating and cooling loads. However, few studies have focused on interpretable classifiers that can generate not only reliable predictive systems but are also easy to understand for the stakeholders. This research investigates the applicability of ML techniques (classification) in the prediction of the heating and cooling loads of residential buildings using a dataset consisting of various variables such as roof area, building height, orientation, surface area, wall area, and glassing area distribution. Specifically, we sought to determine whether models that derive rules are competitive in terms of performance when compared with other classification techniques for assessing the energy efficiency of buildings, in particular the associated heating and cooling loads. To achieve this aim, several ML techniques including k-nearest neighbor (kNN), Decision Tree (DT)-C4.5, naive Bayes (NB), Neural Network (Nnet), Support Vector Machine (SVM), and Rule Induction (RI)- Repeated Incremental Pruning to Produce Error (RIPPER) were modelled and then evaluated based on residential data using a range of model evaluation parameters such as recall, precision, and accuracy. The results show that most classification techniques generate models with good predictive power with respect to the heating or cooling loads, with better results achieved with interpretable classifiers such as Rule Induction (RI), and Decision Trees (DT).

Keywords:

architectural engineering; cooling and heating loads; energy efficiency; machine learning

1. Introduction

The energy performance of buildings (EPB) is an important consideration in architectural and civil engineering research as it impacts directly on the environment and affects the levels of energy wastage. One proposed definition of energy efficiency in terms of building design and construction is “less energy utilisation of buildings including commercial, manufacturing, and residential for cooling, heating, and running electricity to derive the same outcome” [1]. Recent statistics show that the energy consumption of buildings has increased steadily over the last few decades, suggesting benefits to the development of smart optimisation to manage heating, ventilation, and air conditioning (HVAC).

The heating load of a building is the amount of heat energy required to maintain a comfortable indoor temperature. This can be affected by the quality of the insulation, the outdoor temperature, the air exchange rate, etc. [2]. The cooling load of a building is the amount of heat energy that must be discarded to maintain a relaxed interior temperature during hot weather [3]. This can be affected by the extent of solar radiation, the building design, the outdoor climate, the rate of air infiltration, etc. Thus, predicting the thermal loads (both heating and cooling) is crucial for producing energy efficient buildings.

This research focuses on thermal load prediction from a data analytics perspective for optimising the energy efficiency of residential buildings. An important aspect of data science involves the application of machine learning (ML) techniques to explore historical data related to building design to develop outcomes that are of high quality in terms of estimation for engineers [4]. Recently, scholars have used ML techniques to select the best features related to residential home energy efficacy [5,6]. Using cost-effective solutions provided by ML techniques, engineers and designers can be used to determine which building features are able to be optimised, providing precise predictions of the heating or cooling load of a building. This can involve a reduction in the amount of air leakage, an increase in the level of insulation, an adjustment of the orientation of the structure, or an inclusion of HVAC equipment, among others.

Few research studies have focused on developing interpretable ML for heating or cooling load prediction. Developing a thermal load model using ML that is interpretable can provide detailed knowledge in a simple and easy to understand format for engineers and other stakeholders. They are then able to exploit the knowledge to (a) determine how the model predicts, and (b) what building features contribute significantly to the prediction [7].

The aim of this research is firstly to investigate whether ML classification techniques are suitable for predicting heating and cooling loads in the context of the energy consumption problem, and secondly to determine whether classification techniques that consist of rules are competitive in terms of their predictive power when compared with other ML techniques. The energy efficiency problem is considered by focusing on the forecasting of the cooling or heating loads in residential buildings specifically. We also investigate the correlations amongst energy-related features such as orientation, roof area, surface area, glazing area, relative compactness, and others with the cooling and heating load variables using an existing dataset, known as (Tsanas and Xifara, 2012) [8]. These correlations are then used to help decide which features have the largest impact on the cooling and heating loads in terms of energy use.

The main research questions are:

Based on feature selection techniques, which features impact on thermal load prediction in residential buildings?
Are data-driven-classification techniques suitable for accurately predicting the heating or cooling loads in residential buildings?
Are rule-based classification methods competitive in terms of predicting the heating and cooling loads in residential buildings when compared with other classification methods?

The methodology proposed is a modified version of the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology [9]. The proposed methodology consists of the below.

A secondary data analysis of the factors related to the energy efficiency of residential buildings.
Data-driven techniques to predict heating and cooling loads.
An experimental evaluation of the considered data-driven algorithms in (b) using the real data of (a) conducted to identify which approach performs the best in terms of predicting the heating and cooling loads.
The performance measures calculated to assess the performance of the data-driven techniques in predicting heating and cooling, including both the error rate and the predictive accuracy, among others.

2. Literature Review

Guo et al. (2023), mentioned previously, have utilised machine learning techniques in conjunction with optimization strategies for investigating residential house energy efficiency [5]. Their approach involved two main steps: (a) utilizing RF-filtering to identify the most relevant variables, and (b) applying a classification algorithm to construct predictive models for heating and cooling. Specifically, a covariance matrix adaptive evolution strategy (CMA-ES) technique was used to optimise the input features and four hybrid algorithms, namely, Random-LightGBM, CMA-ES-LightGBM, TPE-LightGBM, and Grid-LightGBM, which were used to construct the predictive models. These were then subjected to testing using simulated data collected from residential buildings in the Hohhot region of Mongolia. Subsequently, a comparison was made between these models and other machine learning methods, such as ensemble learners, based on a range of assessment measures. The findings of the study indicate that the TPE-LightGBM approach yielded models with significant correlations between the predicted and the observed, as well as relatively low values for RMSE, MAE, and MAPE when compared to the other models examined in the study.

The hybrid auto-ML framework proposed by Lu et al. (2023) [6] incorporates a combination of various categorization approaches. The framework employs a hyper-optimization method to improve the input variables and classification techniques for constructing models for enhancing energy efficiency in residential structures. A variety of machine learning methods, such as Logistic Regression (LR), naive Bayes, three variants of decision trees, Support Vector Machines (SVM), and XGBoost, were employed in the automated machine learning (auto-ML) process. These algorithms were then assessed using the Tsanas and Xifara (2012) dataset [8], consisting of eight variables and two class labels. The findings demonstrated enhanced predictive capability of the proposed framework in comparison to alternative machine learning techniques.

In addition to helping engineers design buildings with reduced energy consumption, the problem of predicting the thermal loads of residential buildings using ensemble learners has been investigated: one ensemble learner using a decision tree as a base algorithm and another using least square boosting [10]. The researchers introduced an ensemble learning approach known as SFLA-optimised regression tree ensemble (SRTE). They subsequently conducted a comparative analysis between the classification models generated by the ensemble learner and those produced by alternative algorithms, such as Gaussian process regression and stepwise regression. This evaluation was performed using the Tsanas and Xifara (2012) dataset [8], and various performance metrics were employed to assess the models’ effectiveness. The experimental findings demonstrated that the SRTE method was competitive when compared to the selected machine learning algorithms. Their conclusion is supported by a range of performance indicators, such as the coefficient of determination and the mean squared error.

One study focused on the examination of thermal load prediction in smart buildings [11]. In this study, an artificial neural network (ANN) method was used in conjunction with feature selection techniques, the aim being to optimise energy consumption for cost reduction. The researchers utilised maximum relevance minimum redundancy (MRMR), ReliefF, and T-tests to identify the necessary features for the machine learning (ML) algorithm. These features were then employed to train the tri-layered neural network (TNN) algorithm to generate models for the prediction of thermal loads in buildings with Internet of Things (IoT) characteristics. The researchers utilised the Tsanas and Xifara (2012) dataset [8] and partitioned the dataset into a training subset including 70% of the data and a testing subset comprising the remaining 30%. During the training phase, five-fold cross validation was used in the classification models. The empirical findings indicated a strong correlation between the glazing area and the relative compactness of a building with respect to its thermal load and subsequent energy consumption.

AI-based techniques used to create an efficient home energy management system (HEMS) have been reviewed [12]. The authors directed their attention towards research that examined the issue of forecasting various parameters related to Home Energy Management Systems (HEMS), such as household consumption, energy load forecasting, load generation, and the estimation of energy bill pricing, among other relevant aspects. The review encompassed a range of studies published between 2019 and 2021. These studies employed empirical methodologies and utilised data-driven approaches to examine their influence on the predicted accuracy or efficiency of HEMS. The study compared several data-driven methodologies, including decision trees, statistical approaches, probability approaches, and artificial neural networks (ANN), with respect to their various performance metrics.

One study used ML techniques, including random forest and deep learning, to study the forecasting of thermal loads in 768 commercial buildings in China [13]. The authors initially approached the modelling of the data as a regression problem due to the numerical nature of the thermal load factors. Subsequently, they conducted an analysis on commercial building data for the spring season. This analysis involved considering various parameters such as the wall area, the relative density, the surface area, the roof area, the height, the area of insulation material and its distribution, and the building orientation. To evaluate the performance of the generated models, two techniques were employed: Pearson Correlation and Recursive Feature Elimination with an Extra Random Tree algorithm. These techniques were used to evaluate the performance of the models for each subset of features included. During each iteration of the feature analysis, one feature is systematically eliminated from the dataset to assess its impact on the performance of the derived model in relation to the previous iteration. The experimental findings obtained from the dataset demonstrate that random forest, Bagging, and extreme tree algorithms exhibit superior performance compared to classic machine learning methods, such as decision trees, k-nearest neighbors (kNN), and linear regression.

A wide range of methods based on machine learning for energy systems modelling has also been investigated [14]. These methods include deep learning, physics-based modelling, ANNs, SVM, and ensemble prediction techniques. Various datasets related to building energy consumption, thermal load forecasting, district heating systems, and occupancy detection were utilised. Results obtained from the data-driven techniques showed acceptable error performance in terms of RMSE for thermal load prediction and the time required to generate the predictive models ranges from 1.3 to 2.9 min, which is acceptable if we consider that predictions are derived on an hourly basis during the year. The study revealed that the combination of adopting different machine learning algorithms with optimization and feature selection techniques contributes to the overall advancement of data-driven models in thermal load prediction.

An integration of Teaching Learning Based Optimization (TLBO) with learning models from ANN to estimate cooling loads in residential buildings has been investigated [15]. The authors employed multivariable linear regression and Multilayer Perceptron (MLP) and Adaptive Neuro-Fuzzy Inference Systems (ANFIS) for office building cooling load prediction in addition to utilizing datasets comprising building specifications and cooling loads. The results of the hybrid models (ANN with TLBO) highlighted TLBO-MLP’s superior accuracy, achieving an R2 of 0.96446 and a lower RMSE compared to TLBO-ANFIS. This combination of TLBO with ANN models enhances prediction accuracy, offering a promising approach for energy conservation in smart buildings.

A study on predicting building thermal loads using a hybrid deep reinforcement learning ensemble optimization model has also been conducted [16]. The authors proposed a few hybrid methodologies that integrate deep reinforcement learning with ensemble optimization to enhance energy-saving predictions for heat loads. Specifically, genetic algorithm, firefly algorithm, principal component analysis, and the minimum sum of squares of the combined prediction errors have been integrated into ANN models such as back propagation and ELMAN. The dataset utilised to evaluate the machine learning models included solar radiation and weather information. Results based on experimentation indicated advancements in predicting thermal loads with higher precision and energy-saving potential. The results showed low Mean Absolute Percentage Error of the hybrid leaning models ranging between 5.95% and 7.05%.

A method based on the clustering technique for predicting energy efficiency in buildings has been proposed in the literature [17]. The methodology involves clustering the dataset using k-means to determine optimal clusters based on the silhouette coefficient. Then, an input-doubling method is applied for predicting within each cluster. The dataset used by the authors in the study consisted of 768 observations with 6 independent attributes and 2 dependent attributes related to heating and cooling load. The performance evaluation using Mean Absolute Error (MAE) and Mean Square Error (MSE) showed promising results. In particular, the proposed method demonstrated better accuracy compared to classic Support Vector Regression (SVR with rbf kernel) and a hierarchical predictor.

A recent study employed various data mining techniques to predict building heating and cooling loads, including supervised machine learning techniques such as random forest, support vector regression, and ANNs [18]. The study utilised a dataset comprising influential indicators such as weather conditions, building characteristics, and occupancy patterns. Results showed that the random forest method outperforms others in terms of RMSE, R-squared, and MAE metrics, indicating its effectiveness in predicting heating and cooling loads. This study provides valuable insights into factors influencing building energy usage, aiding in improving energy efficiency and reducing costs.

The predictive modelling of heating and cooling degree hour indexes for residential buildings based on outdoor air temperature variability has also been explored [19]. The author employed ANN and support vector regression and utilised hourly ambient air temperature data. The study also considered clustering techniques to seek whether models’ accuracies of the classification techniques can be achieved. Results show that ANN models outperformed support vector regression models, with significant improvements in predictive accuracy achieved after applying clustering. The best NN model achieves MSE = 0.645, MAE = 0.564, MAPE = 6.134, R2 = 0.981 for heating degree hour (HDH) forecasting, and MSE = 0.555, MAE = 0.508, MAPE = 8.746, R2 = 0.904 for the cooling degree hour (CDH) prediction. Clustering enhances model precision, particularly in extreme temperature conditions, highlighting the potential of these models in optimizing HVAC system operations for thermal comfort and energy efficiency in buildings.

Support Vector Regression (SVR) models, including the optimised Support Vector Regression optimised with the Coot optimization algorithm (SVCO) model, was used in a study to predict heating load consumption in residential buildings [20]. The dataset used in the study comprised building attributes, energy systems, weather conditions, and occupant behavior data. Results based on experimentation demonstrated the SVCO model’s superior performance, reducing prediction errors by 20% to over 50%, achieving an R^2 value of 0.992, and exhibiting a minimum Root Mean Square Error (RMSE) of 0.964. The study compares favorably with other published works, emphasizing the SVCO model’s accuracy in energy consumption forecasting and its potential for guiding energy-efficient interventions, contributing to sustainable building operations.

A previously-mentioned study has been conducted on energy performance analysis in smart buildings using various optimization algorithms [5]. In this study, The authors employed methods including genetic algorithms, Particle Swarm Optimization, and newer approaches like Political Optimiser, Heap-based Optimiser, and others. The analysis utilised in the study used a diverse dataset sourced from energy efficiency databases and simulation models. Results demonstrated improved accuracy in which all considered machine learning models achieved over 89% accuracies in predicting heating and cooling loads, hence optimizing building energy performance, and enhancing indoor thermal comfort.

A recent study applied a probabilistic naive Bayes classification algorithm to estimate cooling load [21]. The author developed three hybrid probabilistic models: naive Bayes (NB), naive Bayes optimised with Mountain Gazelle Optimiser (NBMG), and naive Bayes optimised with Horse Herd Optimization (NBHH). The dataset used in the study comprised training, validation, and testing phases, with metrics like RMSE, R2, and MSE used for evaluation. Results showed superiority of NBMG models when compared with the other probabilistic with an R2 of 0.986, RMSE of 1.129 kW, and MSE of 1.275 kW. Notably, NBMG reduced prediction errors by 20% on average, highlighting its potential for precise energy consumption forecasts.

3. Proposed Methodology

Figure 1 depicts the methodology used in the current study based on the energy efficiency dataset of Tsanas and Xifara (2012) [8]. Once the dataset was pre-processed, the independent variables were evaluated based on their correlation with each other and with the class, as discussed in Section 3.2. We sought to identify a small set of independent variables correlated with the ‘Heating Load’ or ‘Cooling Load’ variables, respectively. Lastly, the complete set of independent variables and the small set of correlated variables were used to train various ML algorithms to evaluate whether interpretable classifiers are competitive with classical ML algorithms in predicting heating load and cooling loads.

3.1. Data Understanding

The dataset consists of 768 simulated buildings generated using the Ecotect simulation tool with 10 different variables, two of which are dependent (the heating load and the cooling load). We created two copies of the dataset: one in which the heating load is the dependent variable (Heating dataset), and the other has the cooling load as the dependent variable (Cooling dataset). According to the authors of the Tsanas and Xifara (2012) dataset [8], the original dataset contains 12 building forms each comprising 18 elements. Each of the simulated buildings has dissimilar dimensions and surface area but similar volume; similar construction materials have been used in the buildings’ elements. The simulated buildings were assumed to be residential buildings with a capacity of seven tenants and located in the city of Athens, Greece.

The dependent variables in the dataset are continuous (decimal variables) with a possible range after rounding of 6.01–43.10 kW, and 10.90–43.08 kW for the heating load and the cooling load, respectively. The independent variables in the dataset include relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area, and glazing area distribution as continuous variables. Using independent variables to model two continuous variables is a regression problem because the target variables are continuous. However, since we would like to compare the performance of ML algorithms that deal with dependent variables with nominal values, the problem under consideration will be treated as a classification problem, consistent with conventional data science approaches using historical observational data.

3.2. Data Preprocessing

Comparing the performance of classification techniques in the dataset requires special treatment of the target variables. In particular, the target variables need to be discretised so that the ML techniques are able to train models on the dataset. Discretization is the mathematical process of converting a continuous variable with a large number of possible values into a categorical variable with a limited number of values [22]. We used the Binning technique with an operator called ‘Discretize by Binning’ built into the RapidMiner tool [23] to pre-process the target variables and convert them into categorical variables. Using the Binning technique, possible values of the Heating load and Cooling load target variables are initially sorted and then divided into bins of equal size. Each possible value is then allocated into a bin based on the bin’s boundaries which are either computed automatically or defined by the user; the number of bins is assigned by the end-user. The number of bins was set to three during the discretization process after several warm-up experiments and based on the available range of target variables. In these experiments, the naive Bayes—Kernal algorithm previously proposed [24] was used to monitor the behaviour of the classification models developed in terms of predictive accuracy, with increasing number of bins. The naive Bayes algorithm was chosen because of its efficiency during training, and its ability to quickly produce models from the dataset [25].

Warming up experiments to choose how many bins were needed during discretization process were conducted. The results are depicted in Figure 2 in which we display reported predictive accuracy in % derived by the naive Bayes algorithm when several different bins are utilised. As shown in Figure 2, when the number of bins increases, the accuracy generally decreases. Based on this performance, and the range of the values of the target variables, the number of bins for the available target variables during the discretization process could be set at between two and five. Computing the average value of two to five while considering truncating the value, the number of bins in the discretization process of the target variables and for experimenting with the ML algorithms was set to three to ensure a fair comparison.

3.3. Methods Used

Several ML algorithms, including a naive Bayes Decision Tree (C4.5 algorithm), ANN (Neural Net), SVM (LibSVM), instance-based learning (kNN), and rule induction (RI), were used. These ML algorithms were chosen for the following reasons:

(1): They use different mechanisms to develop the classification systems, and thus are useful for obtaining a comprehensive assessment of the performance of ML algorithms to address the energy efficiency problem.
(2): Most of these algorithms have been used extensively by other scholars in data-driven energy efficiency research [26,27,28,29].
(3): They are currently available in ML tools, including in Rapid Miner, so they do not need to be separately developed.

The naive Bayes algorithm is a probabilistic method based on Bayes Theorem [25]. kNN is an instance-based learning algorithm that uses neighbouring data instances to predict the target class of any given test dataset [5]. The algorithm uses distance functions to decide on the neighbours used during the classification process. In this research, the number of neighbours was set to five (k = 5) in order to obtain a fair prediction decision [29]. The RI algorithm that was considered, Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [30], uses a ‘separate and conquer’ learning approach to find a simple understandable rule in ‘If-Then’ format from the input dataset.

DT is a type of classification algorithm that constructs a tree from the input dataset. The DT algorithm generally uses a data split criterion to decide which node to choose while growing the tree. Usually, the independent variable that provides the best split according to a mathematical measure is chosen by the DT algorithm. We use a DT algorithm implemented in the RapidMiner ML tool with the information gain criterion. The DT implemented in this research is based on the C4.5 algorithm [31]. Neural Net is a feed-forward ANN that uses a back-propagation algorithm to build classification systems [32]. Finally, LibSVM is a library developed Chang and Lin (2011) to support the SVM algorithm in classification and regression tasks [33]. The LibSVM algorithm uses a sequential minimal optimisation (SMO) algorithm for building classifiers.

In the proposed methodology, an assessment of the energy efficiency variables in the dataset has been performed to identify (a) the independent variables that are similar in order to minimise redundancy in the dataset, and (b) a small set of energy efficiency variables that correlate more with the heating and cooling loads. Using correlation analysis, we aimed to identify a small subset of energy efficiency variables that are highly correlated with the target variables but less correlated with each other. To achieve this goal, we used Pearson Correlation Analysis, as per Equation (1). The Correlation Operator in RapidMiner enables us to generate a correlation matrix between variables.

r = \frac{\sum (X_{i} - {\bar{X}}_{i}) (T_{i} - {\bar{T}}_{i})}{\sqrt{\sum {(X_{i} - {\bar{X}}_{i})}^{2} \sum {(T_{i} - {\bar{T}}_{i})}^{2}}}

(1)

In this equation, X and T denote the two continuous variables, and r is the degree of the correlation, ranging from −1 (a perfect negative correlation), to 1 (a perfect positive correlation).

\bar{X} a n d \bar{T}

are the mean of X and T values, respectively.

RapidMiner, a data science tool that has several implementations for data pre-processing, classification, regression, clustering, association, and visualisation, among others, is the platform used for testing the ML algorithms based on the heating and cooling datasets [23]. All hyper parameters for the considered ML techniques, discretization, and correlation analysis have been left unchanged.

Ten-fold cross validation (10F-CV) is used for testing the models of the ML algorithms in the RapidMiner tool 10.2.0. The testing technique 10F-CV first separates the training dataset into 10 folds, from which the ML algorithm uses 9 folds to develop the classification system, and tests the predictive power of the system on the remaining fold [22]. The process is repeated 10 times during which performance measurements, such as predictive accuracy, are computed. Finally, the accuracies computed in the 10 runs are averaged to estimate the system’s predictive effectiveness.

The performance measurements used to assess the ML models were predictive accuracy, precision, recall, and root mean squared error. Most of these measures were computed from the confusion matrix which depicts the correct classification and misclassifications for each target class variable in the dataset.

4. Results and Discussion

4.1. Feature Selection

For the heating dataset (Figure 3), four variables were found to have high positive correlations with the heating load: building ‘Height’, ‘Roof_area’, ‘Surface_area’, and ‘Wall_area’, all resulting in correlations of 50% or more. The top two most correlated variables associated with the Heating load are: building ‘Height’ with a correlation of more than 88%, and building ‘Roof_area’ with a correlation of approximately 86%. A possible reason for this is that they are used when calculating the heating Load. The ranking results of the variables against the Heating dataset revealed that ‘Orientation’ and ‘Glassing_area_distribution’ are the least correlated variables in the Heating Load class. The results for the Cooling dataset are consistent with those obtained against the Heating dataset using the correlation analysis. The correlation results in Figure 4 show consistency in the ranking of the considered building variables on the target class variables (thermal loads).

The correlation between pairs of building variables were measured without considering the heating or cooling loads. The results show a positive correlation of 88% between the ’Roof_area’ and ‘Surface_area’ variables, which is relatively high, suggesting that one of these variables can be used in the training of the classification model. Relative ‘Compactness’ and building ‘Height’ are highly correlated since the height is a factor in calculating the relative compactness of the building (height being part of the building’s volume). Hence, one of these two variables can be used during the training of the classification algorithm. Moreover, there are high negative correlations between ‘Height’ and the variables ’Roof_area’ and ‘Surface_area’, with correlations of −85% and −97%, respectively, suggesting that these variables should be retained in the input data (after removing similar variables).

4.2. Classification

Several classification algorithms (i.e., KNN, DT, NB, SVM, Nnet, and RI) were evaluated using the heating and cooling datasets to estimate their predictive accuracy, recall, and precision rates. The aim was to determine whether classifiers that generate interpretable rules are competitive in terms of performance measurements relative to other conventional classification algorithms. The evaluation of the classification algorithms was conducted on the following data subsets based on the results of the feature selection phase:

(1): Heating dataset: Consists of the complete set of variables in the heating dataset plus the Heating Load class. This includes relative compactness, surface area, wall area, roof area, overall height, orientation, glazed area, and glazing area distribution.
(2): Heating_Reduced_Subset: Consists of fewer variables plus the Heating Load class after excluding ‘Roof_area’ and relative ‘compactness’. It includes surface area, wall area, overall height, orientation, glazing area, and the glazed area distribution.
(3): Cooling dataset: Consists of the complete set of variables in the cooling dataset plus the Cooling Load class.
(4): Cooling_Reduced_Subset: Consists of fewer variables plus the Cooling Load class after excluding ‘Roof_area’ and relative ‘compactness’.

The predictive accuracy, precision, and recall rates generated by the classification algorithms from the complete building variables in the heating and cooling datasets are depicted in Figure 5a–c, respectively. For the heating dataset, the performances of most of the classification algorithms were found to be acceptable to good, except for when using the NB algorithm. Specifically, the DT and RI algorithms are the best-performing algorithms in terms of accuracy, specificity, and recall rates. For instance, the RI algorithm developed classification models that outperformed kNN, NB, SVM, and the Nnet models, achieving 0.91%, 21.62%, 8.21%, and 3.91% higher accuracy rates, respectively. Only the DT algorithm, which also generates interpretable models after transforming the tree into If-Then rules, developed models with 3.25% higher accuracy than that of RI. The accuracy results demonstrate that classifiers with rules are superior to other ML algorithms such as SVM, Nnet, instance-based learning (kNN), and probability (NB) at least on the heating dataset.

The performance pattern of the precision and recall rates of the classification algorithms on the heating dataset is similar in that rule classifiers (DT, RI) consistently outperformed the other considered classification algorithms, followed by kNN and Nnet. According to the confusion matrix results of the RI algorithm for the possible class values obtained after data discretization, 16, 10, and 8 misclassified test data observations, respectively, were made during model evaluation. For the worst-performing classification algorithms (NB), there were 16, 0, and 184 misclassified test data instances.

After performing discretization using three bins, and before building the classification systems of heating or cooling loads, NB showed good predictive performance in classifying data instances that belong to the first and second categories both achieving on average 83% accuracy or higher. However, for the third category, the number of misclassified data instances by NB was high, all wrongfully classified to the second category and in both heating and cooling data cases. This result indicates that NB had difficulty in distinguishing between the second and third categories after the discretization process compared to other classification algorithms. This observation, if limited, suggests that the NB algorithm is the least applicable classification algorithm with respect to the heating dataset.

A possible reason for the superiority of classifiers with rule algorithms such as RI and DT is the mechanisms used by these algorithms. When searching for patterns in the dataset, not only are these patterns derived, but they are also pruned in a way to reduce possible redundancy and expected misclassification errors. For example, the RIPPER algorithm prunes redundant and not highly predictive rules using optimisation procedures to ensure that only general data coverage rules remain in the model. DT, such as C4.5 algorithms, uses Pessimistic Error Estimation and other tree pruning methods to trim rules by replacing sub-trees with appropriate leaves during tree expansion, and after the tree is constructed. Another possible reason for better classification decisions by the rule classifiers can be observed in the classification step of these types of algorithms in which only rules that align with the test data are used to assign class labels which can minimise massification errors.

For the cooling dataset, the results are consistent with those generated by the classification algorithms from the heating dataset, except for NB. According to Figure 5a–c, four algorithms, i.e., DT, RI, kNN, and Nnet, developed models with accuracies of 93.0–93.43%, revealing that ML algorithms are suitable for predicting the cooling load in residential buildings. The other two classification algorithms (SVM, NB) also developed models with approximately 91% and 88% accuracies, respectively, also showing good predictive performance. The variation in the results in terms of accuracy, precision, and recall by the considered classification algorithms is less when the learning models from the cooling dataset are contrasted with models learned by the same algorithms from the heating dataset.

To assess the reduced set of variables after removing the ‘Roof_area’ and the relative ‘Compactness’ from the heating and cooling datasets, Table 1 shows the performance measurements generated by the considered classification techniques. The numbers highlighted in green, orange, and yellow text in Table 1 denote increases in performance, decreases in performance, and no change in performance, respectively. Based on the results obtained, it seems that no changes have occurred in terms of the predictive performance for the RI algorithm when learning from a reduced subset of variables based on both the heating and cooling datasets.

In addition, DT models generated from the Heating_Reduced_Subset result in a similar level of performance to that generated by the same algorithm from the complete set of variables. A very minor improvement in performance measurements is observed when applying DT and NB on the reduced subset of the cooling data (Cooling_Reduced_Subset). A poorer performance, albeit by a very small amount, is observed when applying Nnet on the reduced number of variables in both the heating and cooling datasets, respectively. The results demonstrate that using a smaller subset of variables to develop classification models for predicting heating and cooling loads is comparable in terms of performance measurements when using the complete building variables. Hence, reducing the redundancy by removing highly correlated variables is useful for the application of energy efficacy prediction when using the datasets considered.

Overall, the results obtained using ML techniques suggests that there is considerable potential for their use in predicting heating or cooling loads. The classification models can be integrated into HVAC and smart home applications to help potential stakeholders such as engineers and landlords optimise energy use and reduce costs.

5. Conclusions

In assessing the energy efficiency of a dwelling, there are many challenges, including predicting the thermal loads, particularly the heating and cooling loads. Addressing this problem can help engineers to optimise heat utilisation and reduce long-term energy costs and environmental harm. An effective technology that can provide affordable and quality solutions from a historical data perspective is ML. Despite the problem of predicting thermal loads having been studied by other researchers using statistical and conventional-based ML techniques, little research has focused on developing classification systems that are interpretable by stakeholders. To fill this gap, our research developed a data-driven methodology where classification algorithms with rules are evaluated on residential building data to measure predictive performance of thermal loads and seek models’ interpretability.

The empirical results on the dataset showed that the classification algorithms considered are suitable for heating and cooling load prediction with superiority for models developed by interpretable algorithms. In particular, RI or DT algorithms derived not only easy-to-understand rules but also competitive models with respect to precision, recall, and predictive accuracy when compared with other ML algorithms. Classification models developed by RI provided simple knowledge which can help different stakeholders in the design and construction of energy-efficient buildings. These models could reveal potential areas for energy-saving, such as increasing the amount of insulation, reducing air leakage, and improving the building orientation. They can also provide details on which features can contribute significantly to predicting thermal loads such as the wall area, the roof area, the relative compactness, the overall height, the surface area, the glazing area distribution, and the orientation, among others. Furthermore, the feature analysis results showed that the ’Roof_area’ and ‘Surface_area’ are influential features on thermal loads despite having high correlation among themselves.

The results show a positive correlation of 88% between the ’Roof_area’ and ‘Surface_area’ variables, which is relatively high, suggesting that one of these variables can be used in the training of the classification model. Relative ‘Compactness’ and building ‘Height’ are highly correlated since the height is a factor in calculating the relative compactness of the building (height being part of the building’s volume). Hence, one of these two variables can be used during the training of the classification algorithm. Moreover, there are high negative correlations between ‘Height’ and the variables ’Roof_area’ and ‘Surface_area’, with correlations of −85% and −97%, respectively, suggesting that these variables should be retained in the input data (after removing similar variables).

One of the limitations of this research is that it does not include advanced AI techniques such as deep learning techniques. An area for future investigation is the development of a new interpretable classification algorithm for the thermal load prediction problem and to compare its performance with other existing ML techniques.

Author Contributions

Conceptualization, F.A.-J.; Methodology, F.A.-J. and K.N.D.; Software, F.A.-J.; Validation, F.A.-J. and K.N.D.; Formal analysis, F.A.-J. and K.N.D.; Investigation, F.A.-J.; Data curation, F.A.-J.; Writing—original draft, F.A.-J.; Writing—review & editing, K.N.D.; Visualization, K.N.D.; Supervision, K.N.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sekisov, A. Problems of achieving energy efficiency in residential low-rise housing construction within the framework of the resource-saving technologies use. In Proceedings of the E3S Web of Conferences, Strasbourg, France, 5–7 May 2021; EDP Sciences: Les Ulis, France, 2021; Volume 281, p. 06004. [Google Scholar]
Designingbuildings. Available online: https://www.designingbuildings.co.uk/wiki/Heat_load_in_buildings (accessed on 31 May 2023).
Rinaldi. 3 Types of Heating and Cooling Loads: Learn the Fundamentals. Rinaldi’s Air Conditioning & Heating. Available online: https://rinaldis.com/heating-and-cooling-loads/ (accessed on 3 April 2023).
Sun, Y.; Haghighat, F.; Fung, B.C.M. A review of the-state-of-the-art in data-driven approaches for building energy prediction. Energy Build. 2020, 221, 110022. [Google Scholar] [CrossRef]
Guo, J.; Yun, S.; Meng, Y.; He, N.; Ye, D.; Zhao, Z.; Jia, L.; Yang, L. Prediction of heating and cooling loads based on light gradient boosting machine algorithms. Build. Environ. 2023, 236, 110252. [Google Scholar] [CrossRef]
Lu, C.; Li, S.; Penaka, S.R.; Olofsson, T. Automated machine learning-based framework of heating and cooling load prediction for quick residential building design. Energy 2023, 274, 127334. [Google Scholar] [CrossRef]
Chen, Z.; Xiao, F.; Guo, F.; Yan, J. Interpretable machine learning for building energy management: A state-of-the-art review. Adv. Appl. Energy 2023, 9, 100123. [Google Scholar] [CrossRef]
Tsanas, A.; Xifara, A. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy Build. 2012, 49, 560–567. [Google Scholar] [CrossRef]
Hotz, N. What Is CRISP DM? Data Science Process Alliance. Retrieved 30 March 2023. Available online: https://www.datascience-pm.com/crisp-dm-2/ (accessed on 15 April 2023).
Pachauri, N.; Ahn, C.W. Regression tree ensemble learning-based prediction of the heating and cooling loads of residential buildings. Build. Simul. 2022, 15, 2003–2017. [Google Scholar] [CrossRef]
Ghasemkhani, B.; Yilmaz, R.; Birant, D.; Kut, R.A. Machine Learning Models for the Prediction of Energy Consumption Based on Cooling and Heating Loads in Internet-of-Things-Based Smart Buildings. Symmetry 2022, 14, 1553. [Google Scholar] [CrossRef]
Almughram, O.; Zafar, B.; Ben Slama, S. Home Energy Management Machine Learning Prediction Algorithms: A Review. In Proceedings of the 2nd International Conference on Industry 4.0 and Artificial Intelligence (ICIAI 2021), Hammamet, Tunisia, 28–30 November 2021; Atlantis Press: Amsterdam, The Netherlands, 2022; pp. 40–47. [Google Scholar]
Liu, J.; Zeng, K.; Wang, H.; Du, B.; Tang, Y. Generalized Prediction of Commercial Buildings Cooling and Heating Load Based on Machine Learning Technology. In Proceedings of the 4th International Conference on Environmental, Industrial and Energy Engineering (EI2E 2020), IOP Conference Series: Earth and Environmental Science, Guiyang, China, 15–17 October 2020; Volume 610. [Google Scholar] [CrossRef]
Fouladfar, M.H.; Soppelsa, A.; Nagpal, H.; Fedrizzi, R.; Franchini, G. Adaptive thermal load prediction in residential buildings using artificial neural networks. J. Build. Eng. 2023, 77, 107464. [Google Scholar] [CrossRef]
Zheng, S.; Xu, H.; Mukhtar, A.; Hizam Md Yasir, A.S.; Khalilpoor, N. Estimating residential buildings’ energy usage utilising a combination of Teaching–Learning–Based Optimization (TLBO) method with conventional prediction techniques. Eng. Appl. Comput. Fluid Mech. 2023, 17, 2276347. [Google Scholar] [CrossRef]
An, W.; Zhu, X.; Yang, K.; Kim, M.K.; Liu, J. Hourly Heat Load Prediction for Residential Buildings Based on Multiple Combination Models: A Comparative Study. Buildings 2023, 13, 2340. [Google Scholar] [CrossRef]
Izonin, I.; Tkachenko, R.; Mitoulis, S.A.; Faramarzi, A.; Tsmots, I.; Mashtalir, D. Machine learning for predicting energy efficiency of buildings: A small data approach. Procedia Comput. Sci. 2024, 231, 72–77. [Google Scholar] [CrossRef]
Mehdizadeh Khorrami, B.; Soleimani, A.; Pinnarelli, A.; Brusco, G.; Vizza, P. Forecasting heating and cooling loads in residential buildings using machine learning: A comparative study of techniques and influential indicators. Asian J. Civ. Eng. 2024, 25, 1163–1177. [Google Scholar] [CrossRef]
Kajewska-Szkudlarek, J. Predictive modelling of heating and cooling degree hour indexes for residential buildings based on outdoor air temperature variability. Sci. Rep. 2023, 13, 17411. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Qiu, X. Estimation of Heating Load Consumption in Residual Buildings using Optimized Regression Models Based on Support Vector Machine. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 1019. [Google Scholar] [CrossRef]
Xu, Y. Research on cooling load estimation through optimal hybrid models based on Naive Bayes. J. Eng. Appl. Sci. 2024, 71, 75. [Google Scholar] [CrossRef]
Han, J.; Pei, J.; Tong, H. Data Mining: Concepts and Techniques; Morgan Kaufmann: Burlington, MA, USA, 2022. [Google Scholar]
Mierswa, I.; Klinkenberg, R. RapidMiner Studio. RapidMiner Account, 9.1.000 (Rev: ef0090, Platform OSX), RapidMiner, Inc., 12 December 2018, rapidminer.com. Available online: https://my.rapidminer.com/nexus/account/index.html (accessed on 20 April 2023).
Pérez, A.; Larrañaga, P.; Inza, I. Bayesian classifiers based on kernel density estimation: Flexible classifiers. Int. J. Approx. Reason. 2009, 50, 341–362. [Google Scholar] [CrossRef]
Murakami, Y.; Mizuguchi, K. Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein–protein interaction sites. Bioinformatics 2010, 26, 1841–1848. [Google Scholar] [CrossRef]
Moradzadeh, A.; Mohammadi-Ivatloo, B.; Abapour, M.; Anvari-Moghaddam, A.; Roy, S.S. Heating and Cooling Loads Forecasting for Residential Buildings Based on Hybrid Machine Learning Applications: A Comprehensive Review and Comparative Analysis. IEEE Access 2021, 10, 2196–2215. [Google Scholar] [CrossRef]
Abdou, N.; El Mghouchi, Y.; Jraida, K.; Hamdaoui, S.; Hajou, A.; Mouqallid, M. Prediction and optimization of heating and cooling loads for low energy buildings in Morocco: An application of hybrid machinearning methods. J. Build. Eng. 2022, 61, 10533. [Google Scholar] [CrossRef]
Prasetiyo, B.; Alamsyah, A.; Muslim, M.A. Analysis of building energy efficiency dataset using naive bayes classification classifier. In Proceedings of the CMSE2018: 5. International Conference on Mathematics, Science and Education 2018, Kuta, Indonesia, 8–9 October 2018; International Atomic Energy Agency (IAEA): Vienna, Austria, 2019. [Google Scholar] [CrossRef]
Batista, G.E.A.P.A.; Silva, D.F. How k-nearest neighbor parameters affect its performance. In Proceedings of the Argentine Symposium on Artificial Intelligence, Mar del Plata, Argentina, 24–25 August 2009; pp. 1–12. [Google Scholar]
Cohen, W.W. Repeated incremental pruning to produce error reduction. In Proceedings of the Machine Learning Proceedings of the Twelfth International Conference ML95, Tahoe City, CA, USA, 9–12 July 1995. [Google Scholar]
Quinlan, J.R. C4. 5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
Kelley, H.J. Gradient theory of optimal flight paths. Ars J. 1960, 30, 947–954. [Google Scholar] [CrossRef]
Chang, C.C.; Lin, C.J. LIBSVM: A Library for Support Vector Machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27. [Google Scholar]

Figure 1. Methodology Used.

Figure 2. Accuracy of naive Bayes with Kernal Models from the Heating and Cooling Load Datasets vs. Number of Bins.

Figure 3. Weights of the Independent variables in the heating dataset using a ‘Weight by Correlation’ operator.

Figure 4. Weights of the independent variables in the cooling dataset using a ‘Weight by Correlation’ operator.

Figure 5. (a) Accuracy of the ML Algorithms on the Heating and Cooling Datasets; (b) Precision of the ML algorithms on the Heating and Cooling Datasets; (c) Recall of the ML Algorithms on the Heating and Cooling Datasets.

Table 1. Performance Measures Generated by the Classification Algorithms from the Reduced Heating and Cooling Data Subsets.

	Accuracy		Recall		Precision		RMSE
Algorithm	Heating	Cooling	Heating	Cooling	Heating	Cooling	Heating	Cooling
kNN	96.62	93.36	96.62	91.02	96.01	90.19	0.16	0.23
DT	98.31	93.49	98.15	89.78	97.93	91.20	0.11	0.21
NB	73.96	90.76	65.87	85.28	49.32	86.90	0.51	0.29
SVM	87.38	91.02	84.12	87.41	84.18	86.93	0.35	0.30
Nnet	90.49	92.44	88.07	86.16	90.02	92.22	0.27	0.24
RI	95.58	93.23	94.84	89.27	95.55	91.44	0.21	0.24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abdel-Jaber, F.; Dirks, K.N. Thermal Load Prediction in Residential Buildings Using Interpretable Classification. Buildings 2024, 14, 1989. https://doi.org/10.3390/buildings14071989

AMA Style

Abdel-Jaber F, Dirks KN. Thermal Load Prediction in Residential Buildings Using Interpretable Classification. Buildings. 2024; 14(7):1989. https://doi.org/10.3390/buildings14071989

Chicago/Turabian Style

Abdel-Jaber, Fayez, and Kim N. Dirks. 2024. "Thermal Load Prediction in Residential Buildings Using Interpretable Classification" Buildings 14, no. 7: 1989. https://doi.org/10.3390/buildings14071989

APA Style

Abdel-Jaber, F., & Dirks, K. N. (2024). Thermal Load Prediction in Residential Buildings Using Interpretable Classification. Buildings, 14(7), 1989. https://doi.org/10.3390/buildings14071989

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Thermal Load Prediction in Residential Buildings Using Interpretable Classification

Abstract

1. Introduction

2. Literature Review

3. Proposed Methodology

3.1. Data Understanding

3.2. Data Preprocessing

3.3. Methods Used

4. Results and Discussion

4.1. Feature Selection

4.2. Classification

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI