CRISP-DM-Based Data-Driven Approach for Building Energy Prediction Utilizing Indoor and Environmental Factors

Elkabalawy, Moaaz; Al-Sakkaf, Abobakr; Mohammed Abdelkader, Eslam; Alfalah, Ghasan

doi:10.3390/su16177249

Open AccessArticle

CRISP-DM-Based Data-Driven Approach for Building Energy Prediction Utilizing Indoor and Environmental Factors

by

Moaaz Elkabalawy

^1,*,

Abobakr Al-Sakkaf

^1,*,

Eslam Mohammed Abdelkader

² and

Ghasan Alfalah

³

¹

Department of Building, Civil, and Environmental Engineering, Concordia University, Montreal, QC H3G 1M8, Canada

²

Structural Engineering Department, Faculty of Engineering, Cairo University, Giza 12613, Egypt

³

Department of Architecture and Building Science, College of Architecture and Planning, King Saud University, Riyadh 11362, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

Sustainability 2024, 16(17), 7249; https://doi.org/10.3390/su16177249 (registering DOI)

Submission received: 4 July 2024 / Revised: 5 August 2024 / Accepted: 20 August 2024 / Published: 23 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

The significant energy consumption associated with the built environment demands comprehensive energy prediction modelling. Leveraging their ability to capture intricate patterns without extensive domain knowledge, supervised data-driven approaches present a marked advantage in adaptability over traditional physical-based building energy models. This study employs various machine learning models to predict energy consumption for an office building in Berkeley, California. To enhance the accuracy of these predictions, different feature selection techniques, including principal component analysis (PCA), decision tree regression (DTR), and Pearson correlation analysis, were adopted to identify key attributes of energy consumption and address collinearity. The analyses yielded nine influential attributes: heating, ventilation, and air conditioning (HVAC) system operating parameters, indoor and outdoor environmental parameters, and occupancy. To overcome missing occupancy data in the datasets, we investigated the possibility of occupancy-based Wi-Fi prediction using different machine learning algorithms. The results of the occupancy prediction modelling indicate that Wi-Fi can be used with acceptable accuracy in predicting occupancy count, which can be leveraged to analyze occupant comfort and enhance the accuracy of building energy models. Six machine learning models were tested for energy prediction using two different datasets: one before and one after occupancy prediction. Using a 10-fold cross-validation with an 8:2 training-to-testing ratio, the Random Forest algorithm emerged superior, exhibiting the highest R² value of 0.92 and the lowest RMSE of 3.78 when occupancy data were included. Additionally, an error propagation analysis was conducted to assess the impact of the occupancy-based Wi-Fi prediction model’s error on the energy prediction model. The results indicated that Wi-Fi-based occupancy prediction can improve the data inputs for building energy models, leading to more accurate energy consumption predictions. The findings underscore the potential of integrating the developed energy prediction models with fault detection systems, model predictive controllers, and energy load shape analysis, ultimately enhancing energy management practices.

Keywords:

smart buildings; sustainability; data mining; data-driven energy prediction models; CRISP-DM; supervised machine learning; occupancy prediction; feature selection analysis

1. Introduction

Global energy consumption has increased 100% in the past 40 years [1]. This surge in energy demand and the escalating energy crisis has necessitated shifting lifestyle practices towards more energy-efficient behaviours. Notably, the average individual spends upwards of 90% of their time indoors, positioning buildings as the predominant consumers of energy on a global scale [2]. In the context of the United States, the building sector accounts for more than 41% of primary energy consumption, surpassing the energy use in both the industrial (30%) and transportation sectors (29%) [3]. Similarly, residential buildings in Saudi Arabia consume about 50% of the nation’s electricity, primarily due to air conditioning, prompting initiatives like the Saudi Energy Efficiency Program (SEEP) to enhance efficiency and increase renewable energy use [4]. In Europe, primary energy consumption was 1257 million tonnes of oil equivalent (Mtoe) in 2022, with a 2.8% decrease in energy consumption to 940 Mtoe. Renewable energy sources accounted for 21.8% of the energy mix, highlighting the region’s efforts to enhance energy efficiency [5].

Consequently, improving building energy efficiency is a global priority, essential for conserving energy and reducing carbon emissions. The significance of building energy prediction in this endeavour cannot be overstated, as it plays a pivotal role in the implementation of energy efficiency measures, including demand response control [6], system fault detection and diagnosis [7], energy benchmarking [8], and measurement and verification of building systems [9]. Given the complexity of factors influencing building energy use, including environmental conditions, operational data, and occupancy, accurately predicting energy consumption presents a significant challenge. This study aims to refine predictive models for building energy consumption by investigating the impact of various physical and environmental attributes, focusing on leveraging Wi-Fi data to predict occupancy. By integrating diverse data sources, this comprehensive approach has the potential to significantly enhance the accuracy of energy consumption predictions and provide a robust basis for developing effective energy management strategies, offering a promising future for energy efficiency.

Engineering-based building energy modelling, a prevalent method for predicting building energy usage, employs physical principles to assess the thermal dynamics and energy behaviour of buildings [10]. This methodology not only aids in the design of energy-efficient buildings [11] but also facilitates the widespread adoption of numerous building energy modelling tools globally, attributed to its straightforwardness and effectiveness [12]. However, its application is predominantly during the building design phase due to its comprehensive review of building details. The requirement for detailed information on building geometry, material specifications, heating, ventilation, and air conditioning (HVAC) system configurations, as well as lighting specifications, renders it less feasible for existing buildings [10]. Furthermore, discrepancies between predicted and actual energy consumption have been observed in recent studies, indicating limitations in this approach [11,12].

An alternative, the empirical modelling approach, has gained popularity for its effectiveness in predicting building energy consumption over the past two decades, thanks to its superior implementation ease and accuracy [13,14]. This method leverages machine learning algorithms, including decision trees (DT) [15], artificial neural networks (ANN) [16,17,18], Gaussian processes regression [19,20], K-nearest neighbours [21], support vector machines (SVM) [22,23,24], gradient boosting trees [25], adaptive neuro-fuzzy inference systems [26] and long short-term memory networks [27,28] to establish generalizable relationships between the input and output data. Empirical modelling is particularly advantageous for the energy prediction of existing buildings, as it relies on readily available data such as building energy consumption, environmental conditions, and occupancy information.

2. Background Literature

The precise selection of inputs is essential for developing accurate building energy load prediction models. Recent scholarly efforts have focused on refining the input selection process to enhance prediction accuracy [29,30,31,32]. For instance, Pearson correlation analysis and principal component analysis were utilized to select the model inputs, resulting in a minimum 4% increase in prediction accuracy [29,31]. Furthermore, Sholahudin and Han [33] employed Analysis of Variance (ANOVA) to determine the most impactful factors on heating load among variables such as outdoor dry-bulb temperature, dew point temperature, direct average radiation, diffuse horizontal radiation, and wind speed. Their analysis indicated that dry-bulb temperature and wind speed significantly affect heating load, justifying their inclusion as model inputs. Similarly, Kapetanakis et al. [34] employed Pearson and Spearman’s correlation coefficients to investigate the relationship between building load and environmental factors, including ambient temperature and humidity, wind speed, solar radiation, and indoor air conditions, across diverse climates. They established a threshold correlation value of 0.5 for incorporating a variable into their predictive model, comparing model performance using a comprehensive versus a selected set of variables to evaluate accuracy. In another attempt, Cui et al. [35] deployed the shapely additive explanations (SHAP) method to determine the influencing factors on the energy consumption of households. They determined that total square forage, dry-bulb design temperature, and heating space with natural gas sustained the highest impact on energy consumption. Fan et al. [36] explored various feature extraction methods, including engineering, statistical, structural approaches, and unsupervised deep learning, to select inputs for predicting the cooling load. They notably considered the influence of historical data, observing significant prediction improvements with features extracted via unsupervised deep learning. Furthermore, while these studies contribute valuable insights into the input selection for energy load prediction models, limitations remain. Most notably, the focus has been predominantly on external variables, overlooking the influence of indoor factors such as occupancy, which are crucial for accurately predicting heating and cooling loads. Additionally, the potential for multicollinearity among the selected inputs underscores the need for a meticulous selection process to avoid redundancy, which could detract from the model accuracy and computational efficiency. The inclusion of occupancy data in building energy consumption prediction models is of paramount importance for enhancing their accuracy and relevance. Accurate occupancy information provides a critical foundation for understanding and forecasting the energy demands of a building, as occupancy levels directly influence the operational energy requirements. This is because the presence or absence of occupants affects heating, cooling, lighting, and the use of electrical appliances, all of which are significant contributors to a building’s overall energy profile [37,38]. However, the acquisition of precise occupancy data is fraught with challenges. Privacy concerns stand out as a primary hurdle, as collecting data related to the presence of individuals within a space can raise significant ethical and legal issues [39]. Beyond privacy, logistical complexities also present a significant barrier. The dynamic nature of occupancy, with individuals moving in and out of spaces throughout the day, requires robust and continuous monitoring systems to capture accurate data [40]. Moreover, the inherent variability in occupancy patterns, influenced by factors such as time of day, day of the week, and seasonal changes, adds another layer of complexity to the data collection process [41]. Occupancy detection can be achieved by monitoring indoor carbon dioxide (CO₂) levels or employing passive infrared (PIR) sensors in building energy management. CO₂ sensors are commonly utilized within demand-controlled ventilation systems, where the provision of ventilation or external air to an indoor environment is regulated based on the measured CO₂ parts per million (ppm) levels. However, it has been observed that the response time of CO₂ sensor-based detection methods needs to be more swift for application in commercial buildings [42]. In contrast, PIR sensors are predominantly used to automate lighting controls based on occupancy. Research has indicated that occupancy-driven controls can reduce lighting energy consumption by up to a 30% reduction [43]. Nonetheless, PIR sensors depend on a clear line of sight for accurate motion detection, making them vulnerable to false negatives, such as the inadvertent switching off of lights despite the presence of occupants [44]. Further, electricity and water consumption data analysis has been introduced as an alternative approach for more efficient occupancy inference [45]. In commercial non-residential settings, the analysis of Wi-Fi activity presents a novel opportunity for occupancy monitoring without needing modifications to the existing building infrastructure, apart from the requisite data collection and processing. The utility of Wi-Fi data in detecting and predicting building occupancy has been validated through various case studies [46,47,48]. While overall building occupancy measurement accuracy is high, the precision at the floor and room level is significantly lower, attributed to devices connecting to Wi-Fi access points not located near the user [49]. Additionally, a positive correlation has been documented between the Wi-Fi connection counts and trends in building electricity consumption, suggesting these metrics can partially elucidate the observed energy consumption patterns [50]. The application of Wi-Fi signals as implicit sensors for controlling HVAC and lighting systems has also been explored. Ref. [51] reported that occupancy-based control systems based on Wi-Fi signals could achieve more than 90% and 80% energy savings for lighting compared to static schedules and PIR-based controls, respectively. Regarding HVAC systems, Balaji et al. [52] demonstrated a potential for 17% energy savings by utilizing the existing Wi-Fi infrastructure for system actuation. W. Wang et al. [53] introduced a framework for an occupancy-linked energy-cyber-physical system that incorporates occupancy data through the active scanning of Wi-Fi connection requests and responses, achieving approximately 26% savings in energy consumption for cooling and ventilation.

In response to the escalating global energy demand and the subsequent energy crisis, there is a pressing need to enhance energy efficiency, particularly within the building sector, a major energy consumer worldwide. Recognizing that buildings account for a significant portion of energy use, this study aims to refine predictive models for building energy consumption. Such models are crucial for implementing effective energy-saving measures, fault detection and diagnosis, and reducing carbon emissions. Given the complexity of factors influencing building energy use, including environmental conditions, operational data, and occupancy, accurately predicting energy consumption presents a significant challenge. This challenge underscores the importance of developing models incorporating a wide range of data and reflecting the dynamic nature of occupancy and its impact on energy usage. With this context in mind, the study is guided by the following objectives, structured according to the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology:

Business Understanding and Data Understanding: Identify key physical attributes influencing energy consumption, acknowledging that specific building characteristics significantly affect energy use, and review previous energy prediction models.
Data Preparation and Modelling: Investigate different machine learning algorithms on Wi-Fi connection data to predict occupancy, aiming to enhance our dataset by improving the accuracy of energy prediction models.
Modelling and Evaluation: Predict the total energy consumption by integrating diverse data sources, including environmental conditions and building operational data. This comprehensive approach aims to capture the multifaceted nature of energy consumption in buildings.
Evaluation and Deployment: Examine the influence of predicted occupancy data on the accuracy of energy consumption predictions. This analysis is critical for assessing the added value of occupancy data and enhancing the precision of predictive models, thereby improving the basis for energy management strategies.

This study distinguishes itself from previous research by exploring the use of Wi-Fi data as a proxy for occupancy, leveraging existing building infrastructure to enhance the accuracy of energy consumption models. Unlike many studies that primarily focus on outdoor variables, this research incorporates indoor and outdoor parameters, including but not limited to HVAC operating data and occupancy count, providing a more robust and comprehensive energy prediction model. Additionally, this study performs an iterative feature selection approach to refine the selection of indoor and outdoor variables, addressing the issue of multi-collinearity and redundancies that could impair the prediction models’ accuracy and computational speed. Furthermore, following the CRISP-DM methodology, this research involves thorough data understanding and preparation phases, evaluates six machine learning models using robust metrics, and employs 10-fold cross-validation to ensure the models are accurate and generalizable. The impact of accurately predicting occupancy on energy consumption models is empirically demonstrated, providing strong support for incorporating dynamic occupancy data into energy management strategies.

3. Research Framework

The present study employs the CRISP-DM methodology to systematically enhance predictive models for building energy consumption, as demonstrated in Figure 1. CRISP-DM is a comprehensive, standardized approach to data mining projects, consisting of six primary phases: business understanding, data understanding, data preparation, modelling, evaluation, and deployment. CRISP-DM is widely recognized for its structured yet flexible framework, making it particularly suitable for data-driven improvements in energy management and Industry 4.0 applications. Despite being introduced over two decades ago, CRISP-DM remains the de facto standard due to its reliability, flexibility, and ability to generate explainable data and robust models [54].

Step 1:Business Understanding

The first phase of CRISP-DM focuses on comprehending the project’s objectives and requirements from a business perspective. This involves defining the problem, setting specific goals, and assessing the current situation to establish the project’s scope and constraints.

The core problem is the high energy usage in buildings, which leads to significant costs and carbon emissions. Improving energy efficiency through accurate prediction models is crucial. Traditional methods often fail to capture the complex interactions between factors like occupancy patterns, HVAC operations, and environmental conditions. Setting specific goals includes identifying the key attributes influencing energy consumption, such as environmental parameters, HVAC conditions, and occupancy counts. A primary objective is to enhance prediction accuracy using Wi-Fi data for occupancy, as these data are crucial for optimizing heating, cooling, and lighting. Assessing the current situation involves analyzing existing building energy data, recognizing the limitations of current models, and understanding the potential benefits of improved predictions. Current models often lack integration of dynamic occupancy data, leading to less effective energy management. By addressing these limitations, the study aims to provide more accurate and actionable insights, leading to better energy management strategies, reduced energy wastage, and improved occupant comfort.

Step 2: Data Understanding

The data understanding phase involves collecting initial data and familiarizing oneself with it. This includes data collection, exploratory data analysis (EDA), and assessing data quality. By exploring the data and identifying potential issues, researchers can develop initial insights and form hypotheses about the relationships within the data. This step is crucial for understanding the data’s structure, quality, and potential for addressing the business problem. Data collection gathers comprehensive information on building energy consumption, environmental conditions, and Wi-Fi connection data to predict occupancy. Exploratory data analysis (EDA) techniques explore the data, identify patterns, and generate hypotheses. This includes using visualization and statistical analysis to reveal trends and correlations. Finally, data quality assessment involves checking for missing values, outliers, and inconsistencies in the data. Understanding these aspects ensures that the data are reliable and suitable for developing accurate predictive models.

Step 3: Data Preparation

Data preparation is often the most time-consuming phase, as it involves cleaning and transforming the raw data into a format suitable for modelling. This process ensures that the data are accurate, consistent, and ready for analysis. Key activities in this phase include data cleaning, data transformation, and feature selection. Data cleaning addresses issues identified during the data understanding phase, such as missing values, outliers, and inconsistencies. This step ensures the dataset is complete and reliable, which is crucial for developing robust predictive models. Data transformation involves normalizing, aggregating, and encoding the data to ensure it is in a suitable format for modelling. Normalization scales the data to a consistent range, aggregation summarizes the data at a higher level, and encoding converts categorical variables into numerical format. These transformations facilitate the application of machine learning algorithms. Feature selection identifies the most relevant attributes for predictive modelling. Techniques such as Pearson correlation analysis and principal component analysis (PCA) are used to determine which features significantly impact energy consumption. By thoroughly preparing the data, we ensure the subsequent modelling phase is based on high-quality, relevant data, leading to more accurate and reliable predictions.

Step 4: Modelling

In the modelling phase, various machine learning algorithms are applied to the prepared data to develop predictive models. This phase involves selecting appropriate algorithms, training the models, performing initial validation, and optimizing hyperparameters to ensure the models learn the patterns within the data effectively. The dataset was first split into 80% for training and 20% for testing to ensure the models could be evaluated on unseen data. The choice of algorithms, such as decision trees, artificial neural networks (ANN), and support vector machines (SVM), depends on the specific problem and the dataset’s characteristics. Hyperparameter optimization is a crucial part of the modelling phase. Techniques like grid search are used to systematically test different combinations of hyperparameters to find the optimal set that maximizes model performance. By fine-tuning these parameters, the model’s accuracy and robustness are enhanced.

Step 5: Evaluation

The evaluation phase is crucial for assessing the performance of the machine learning models to ensure they meet the desired accuracy and reliability. Multiple performance measurements are applied, starting with model testing using metrics such as the coefficient of determination (R²) and Root Mean Square Error (RMSE). R² indicates how well the model’s predictions match the actual data, with values closer to 1 reflecting better performance. It represents the proportion of the variance in the dependent variable that is predictable from the independent variables. RMSE measures the average magnitude of the errors between predicted and actual values. It is the square root of the average squared differences between predicted and observed values. A lower RMSE value indicates a better fit of the model, implying minor differences between the predicted and observed values. To further ensure model accuracy, residual plots are used to check if the errors (residuals) are normally distributed around a mean of zero. This indicates unbiased predictions and random error distribution, which are signs of good model performance. Residual plots also help diagnose potential issues like non-linearity, heteroscedasticity, or outliers. Additionally, a 10-fold cross-validation is performed for all models to ensure they do not memorize or overfit the data. Cross-validation involves partitioning the data into ten equal parts, training the model on nine parts, and testing it on the remaining part. This process is repeated ten times, each part used once as the test set, providing a robust evaluation of the model’s performance. By combining these performance measurements and validation techniques, the study ensures a comprehensive assessment of the machine learning models, leading to more reliable and accurate predictions of building energy consumption. This structured approach ensures that each phase contributes effectively to the overall goal of accurate energy consumption prediction.

Step 6: Deployment

The deployment phase involves implementing the predictive model in a real-world setting. While this phase is crucial for operationalizing the model, it falls outside the scope of this study. However, planning for deployment includes outlining how the model could be integrated into building management systems for real-time energy monitoring and management.

Fault Detection Systems: The deployed energy prediction model could be used to analyze the values of the building energy consumption and then analyze the predicted values to detect any discrepancies within the HVAC input data that would result in an error. Examples include but are not limited to the following:
- Errors within the rooftop unit (RTU) outdoor airflow;
- Errors within the RTU supply air temperature;
- Errors within the RTU total supply airflow.
Load Shape Analysis: The model can analyze and predict the energy consumption for similar buildings within the area under different conditions, such as summer overheating periods, when buildings tend to consume more power for their HVAC systems. Power-supplying organizations can then incorporate the model to prepare to meet their clients’ needs.
Model Predictive Control: The model can be fitted with a model predictive control that will be trained on the history of the building to learn how the different values for the predictors affect the target variable (energy) and then start to take actions by modifying the set-pints for the HVAC system such as the airflow values and the temperature setpoints that would result in lower energy consumption without affecting the occupant’s comfort. Compared to rule-based control (RBC), MPC is a very efficient system that has recently gained interest in enhancing building energy efficiency.

4. Development of the Energy Prediction Framework: A Case Study Approach

4.1. Data Understanding

This research uses a dataset from an office building known as Building 59 or Wang Hall [55], which is situated within the Lawrence Berkeley National Laboratory (Berkeley Lab) campus in Berkeley, California. The region experiences a Mediterranean climate, characterized by mild, wet winters and dry summers, which can significantly influence the building’s energy consumption patterns. Berkeley Lab (Figure 2) encompasses 10,400 m² of conditioned space distributed over four floors. The lower level houses mechanical systems, the second level is occupied by the National Energy Research Scientific Computing Center (NERSC), and the third and fourth levels serve as office spaces. Specifically, the third floor is mainly composed of closed office spaces, while the fourth floor is predominantly an open office space.

The building is segmented into 57 thermal zones distributed across its left and right wings. Zones with exterior walls and windows are categorized as exterior zones, while all others are classified as interior zones. Four RTUs provide HVAC coverage for the entire building, with two RTUs per wing. Temperatures in exterior zones are monitored by wall-mounted sensors installed in each zone served by an under-floor terminal (UFT), which is part of the building automation system (BAS). Interior zone temperatures are measured by 16 desk-level sensors added by the research team, built with Raspberry Pi Zero W and DS18B20 Digital Temperature Sensors. These sensors are placed near occupant workstations. Additionally, camera-based sensors from TRAF-SYS are installed at the six entrances/exits of the building’s southern wing to measure occupant counts. The camera-based sensor can count the number of people entering and exiting the space. Calculating the net flow of people crossing the boundary can determine the number of occupants in the target area. Additionally, Wi-Fi data were gathered with the assistance of Berkeley Lab’s IT department. The total number of connected devices at each Wi-Fi access point (AP) was recorded and then aggregated by the floor level based on the location of each AP. HVAC system operational data are retrieved using the ALC SOAP web interface, which collects data from specified points within the ALC WebCTRL Building Automation System. For electrical consumption data, queries are made to the ElasticSearch database, where the data are stored for specific points accessible through a web endpoint. Finally, historical weather data for the building were obtained from SynopticLabs (MesoWest and SynopticLabs 2017), utilizing a weather station mounted on a tower located approximately 300 m northeast of the building on the Berkeley Lab campus. The data collected include outdoor air temperature, dew point, precipitation, pressure, relative humidity, solar irradiation, wind speed, and wind direction.

The building dataset includes the total energy consumption, HVAC system operating parameters, and indoor and outdoor environmental parameters. Data were collected over three years from more than 300 sensors and meters situated on two floors of offices covering 2325 square metres. The dataset includes 21 attributes, covering the entire building, although only the south wing was considered due to incomplete details and attributes in the north wing. The data were recorded at different time resolutions for each attribute from January 2018 to December 2020, as seen in Table 1. Lastly, the datasets were resampled to a 15 min time step to encompass all data. All attributes were subjected to exploratory data analysis (EDA) to understand the data better. EDA included summarizing the data using descriptive statistics, visualizing distributions, detecting patterns, identifying anomalies, and testing initial hypotheses. This thorough analysis helped gain insights into the data structure and relationships, guiding the subsequent data preparation steps.

4.2. Data Preparation

All datasets were thoroughly examined for physical inconsistencies or inaccurate data and then removed to ensure the consistency of the recorded data. Understanding the physics behind each attribute and the data recording process was a crucial component of this process. The data were refined by removing statistical outliers with the mean ± three times the standard deviation for each attribute.

The heating and cooling temperature setpoints had a nine-month gap from January to 18 September, while the occupancy counts were recorded from 18 May to 19 February with only four months of mutual data. A two-step calculation process for the heating and cooling temperature setpoints was implemented to increase the mutual data between the occupancy counts and temperature setpoints. As the temperature setpoints values were recorded every 5 min, the first step was to impute values for the 4 min intervals only if the previous and next values were identical. The resulting values were available at 1 min intervals. The second step was imputing all the other missing values using the mean value. Additionally, to avoid dependencies among different attributes, only the average heating and cooling temperature setpoints for the different zones of the south wing were used in the machine learning models.

Occupancy data were collected from 18 May to 19 February 2018 in 10 min intervals. Wi-Fi connections were available from 1 May to 30 July 2018 and again in the last nine months of 2020. We summed the Wi-Fi and occupancy counts on each floor to obtain the total counts for the whole wing The data from 2018 were utilized to develop a model for predicting occupancy in the last nine months of 2020. Previous research on the same building and dataset identified a delay between Wi-Fi connections and occupancy [56]. Our initial analysis also confirmed this finding, as illustrated in Figure 3. This lag, noted in several studies, can be attributed to factors such as Wi-Fi network management systems [57], privacy protection features generating randomized MAC addresses, continuous device connections during lunch breaks [58], and the discrepancy between connection numbers during morning arrivals and evening departures when devices remain connected [56].

To address this issue, we shifted the Wi-Fi connection data by one hour to synchronize with occupancy data. Subsequent analyses in this study are based on these adjusted Wi-Fi connections, as demonstrated by the time series of the Wi-Fi and occupancy data shown in Figure 4. To better represent occupancy data and assess the number of Wi-Fi connections associated with occupancy, stationary devices were removed from the Wi-Fi connection data. This was accomplished by identifying the minimum number of Wi-Fi connections per day and subtracting this value from each Wi-Fi data point. Additionally, time was not factored into the occupancy prediction, as the temporal patterns were already captured by the behaviour of the Wi-Fi connections.

A Density-Based Spatial Clustering of Applications with Noise (DBSCAN) model was applied to the occupancy and Wi-Fi connections dataset before running prediction models to eliminate non-statistical outliers. The DBSCAN algorithm identified 3.22% of the data as noise, which is presented with purple in Figure 5. Subsequently, Artificial Neural Network Regression (ANNR), DTR, and Multiple Linear Regression (MLR) were utilized for occupancy prediction, with detailed performance indicators provided in the Results section.

PCA, decision tree regression (DTR), and Pearson correlation analysis were employed to perform feature selection to identify the most influential attributes on energy consumption. Pearson correlation analysis measures the linear relationship between two variables. PCA is a linear transformation technique that reduces the dimensionality of high-dimensional data by projecting it onto a lower-dimensional subspace while preserving most of the original variance. This technique is beneficial when dealing with many features that may be redundant or correlated. We identified features with the highest contribution to the selected principal components through PCA, achieving a cumulative variance of 90%. On the other hand, DTRs are non-linear models that partition the feature space into a set of disjoint regions based on a set of rules. They can be used to identify the most informative features in a dataset by recursively splitting the data into subsets based on the feature values. The decision tree algorithm selects the most discriminative feature at each split point, resulting in a tree structure that can be used to predict the target variable. Once the initial sets of important features were identified through PCA and DTR, an iterative feature selection process was implemented to refine these features further. This process involved the following steps:

Initial Selection:
- Features identified by PCA and DTR were combined to form an initial set of candidate features.
Model Training and Evaluation:
- Using the initial set of features, a baseline model was trained. Performance metrics such as R² and RMSE were calculated to evaluate the model’s accuracy.
Iterative Refinement:
- In each iteration, the subsets of features were systematically removed or added based on their contribution to the model’s performance. The model was retrained with each new subset of features. Features that did not contribute significantly to improving R² or reducing RMSE were eliminated. This iterative process continued until no significant improvements in model performance were observed, ensuring the most valuable and non-redundant features were retained.
Validation with Pearson Correlation Analysis:
- Pearson correlation analysis was conducted on the selected features to ensure they were not collinear. This step was crucial to eliminate multicollinearity, which could impair model performance and interpretability.

4.3. Building Energy Modelling

The third objective of this study was to evaluate various data-driven models for predicting building energy consumption across different predictor values. Six machine learning models were selected based on their proven effectiveness in previous building energy prediction research [59].

MLR: Establishes a linear relationship between predictors and the target value by fitting a linear equation to observed data.
Polynomial Regression (PR): Extends MLR by employing polynomial equations of varying degrees to capture non-linearity in the data. This approach is useful when the relationship between the predictors and the target is non-linear but can be approximated by a polynomial.
DTR: Uses a tree-based approach where the data are recursively split into subsets based on the feature that provides the highest information gain. The splitting criterion can be based on metrics like the Mean Square Error.
Random Forest (RF): An ensemble method that combines multiple decision trees to improve predictive performance and robustness. The final prediction is typically the average (for regression) of the predictions from all individual trees.
Support Vector Regression (SVR): This method defines the best hyperplane that maximizes the margin between the data points and the hyperplane, with the goal of minimizing prediction errors within a specified tolerance.
ANNR: Consists of interconnected neurons organized in layers, where each neuron processes inputs and delivers outputs based on an activation function.

These six models were applied to two distinct periods in the dataset. The first period, before occupancy prediction, contained 10,273 data points from September 2018 to February 2019. The second dataset combined predicted occupancy data with the previous dataset, yielding 30,645 data points from September 2018 to December 2020. Notably, nine months of 2019 were excluded due to the absence of Wi-Fi and occupancy values.

5. Results and Discussion

The following section will present the results for feature selection and occupancy prediction, along with the two datasets used for the ML models to predict energy (before and after the occupancy prediction).

5.1. Feature Selection

Combining PCA and DTR successfully achieved the first research objective: identifying the most informative and non-redundant features for energy modelling and prediction tasks. This approach led to improved model performance, reduced computation time, and better interpretability of the results. Guided by the combined set of features, an iterative approach was taken to determine the best combination of attributes that would yield the highest R² and RMSE values. Finally, a Pearson correlation analysis was conducted on the selected attributes to ensure that the resulting combined attributes were not collinear, as illustrated in Figure 6. The iterative process involved systematically evaluating the contribution of each feature by training multiple models with different feature subsets and comparing their performance metrics. Features that did not contribute to significant improvements in R² or reductions in RMSE were gradually removed. This iterative refinement continued until the optimal combination of features was identified. Finally, a Pearson correlation analysis was conducted on the selected attributes to ensure that the resulting combined attributes were not collinear, as illustrated in Figure 6. Darker cells in the matrix indicate high correlation values close to 1, while lighter cells represent low correlation values further from 1. This comprehensive approach ensured that the final set of features used in the predictive models were both highly informative and non-redundant, leading to more accurate and efficient energy consumption predictions.

5.2. Occupancy Prediction

After removing 3.22% noise from the dataset using DBSCAN, three machine learning models—ANNR, MLR, and DTR—were deployed for occupancy prediction to achieve the second research objective. Table 2 shows the average performance indicators of the training model for occupancy based on the Wi-Fi connection data for 2018. For the occupancy prediction in 2020, the ANNR demonstrated the best accuracy.

Although the prediction model shows good results, it should be mentioned that the training model was developed during normal conditions and used during the COVID period. In 2020, a global lockdown was implemented to curb the spread of COVID-19 in many countries. The impact of the pandemic lockdown on building energy use was complex, varying by building type, climate, and operational policies. Restrictions on occupant activities generally led to reduced energy consumption in office buildings, particularly for electric devices such as lighting and plug loads. However, the lingering effects of the lockdown posed challenges to improving building energy efficiency by introducing uncertainty and additional measures to prevent virus transmission. The COVID-19 pandemic has introduced several limitations to our study. The irregular occupancy patterns during lockdowns and remote work periods do not accurately reflect typical usage, which may bias model training and evaluation. Furthermore, various Wi-Fi connection behaviours can significantly impact the performance and applicability of the model. Different connection durations, such as long-term versus short-term connections, can influence the accuracy of occupancy predictions. Long-term connections may provide a stable indicator of occupancy, while short-term connections might introduce noise due to transient visitors or sporadic device usage [60]. Additionally, it is important to note that the occupancy data collected primarily from Wi-Fi connections does not account for occupants who do not connect their devices to the Wi-Fi network [61]. The density of Wi-Fi access points and the spatial distribution of users within a building can also affect model accuracy. High-density areas with frequent Wi-Fi handoffs may result in more complex data patterns that the model needs to interpret accurately [62].

5.3. Building Energy Prediction

The six previously discussed machine learning models were evaluated to accomplish the third research objective, and their results were compared. The models were tested using two distinct datasets: one before and one after incorporating the occupancy predictions.

5.3.1. Energy Prediction with Actual Occupancy

The initial dataset used with the models was created by merging attributes without incorporating occupancy predictions. The R² and RMSE values for this dataset are presented in Table 3. Additionally, the residual plot for the RF model is shown in Figure 7. The results indicate that the RF and SVR models performed well regarding R² values. However, the SVR model exhibited issues with its residuals, which did not follow a normal distribution and had a mean value close to 24, indicating poor performance. In contrast, the RF model demonstrated superior prediction performance, making it the best model among those tested.

5.3.2. Energy Prediction with Predicted Occupancy

The second dataset used with the models was created by merging the attributes after predicting occupancy for 2020 and incorporating these predictions into the dataset. The computation time was recorded to reflect the duration required for processing a sufficiently large dataset for this application. The R² and RMSE values for this dataset are presented in Table 4, and the residual plot for the RF model is shown in Figure 8 below. The results indicate that the RF model provided the highest R² value and the lowest RMSE, signifying better model performance. Although the SVR model produced similar results in terms of R², its residuals were not normally distributed, and their mean value was nearly 26, indicating poor performance for this problem. While the RF model achieved the best results, it required more computation time than the other models, taking over 2 h to complete. This longer computation time has several important implications. Firstly, it may require more computational resources, increasing operational costs, particularly if the model needs to be run frequently or in real-time applications. Secondly, the need for frequent updates or retraining to adapt to new occupancy patterns or environmental conditions may be limited by these computational demands, reducing the model’s adaptability. In contrast, the DTR model demonstrated the best performance in computation time and residuals’ distribution, making it a more efficient choice for scenarios where computational resources and time are limited. Balancing performance with computational efficiency is crucial. The improved accuracy of the RF model can lead to more precise energy management, which may justify the additional computational effort in many cases. However, the DTR model may be more practical for applications requiring real-time predictions or where computational resources are constrained.

5.3.3. Error Propagation

The final objective is to examine the influence of predicted occupancy data on the accuracy of energy consumption predictions. After incorporating the predicted occupancy data from the second dataset, which provides additional data points for training the energy prediction models, it is essential to investigate the error propagation using these predicted results. Error propagation refers to the impact that inaccuracies in the predicted occupancy data can have on the subsequent predictions of energy consumption. To account for this, several steps have been followed:

Data Extraction: The data from May to July 2018 for occupancy and Wi-Fi was extracted.
Occupancy Prediction: The occupancy was predicted based on Wi-Fi connections using a DTR model.
Dataset Merging: The predicted and actual occupancy datasets were merged with the energy predictors’ datasets, resulting in a relatively small dataset.
Model Training and Testing: Two different Random Forest energy prediction models were trained and tested using the predicted and actual occupancies.
Performance Comparison: The R² and RMSE values for the energy prediction models with actual and predicted occupancy were compared. The energy prediction model with predicted occupancy had an R² of 0.667 and an RMSE of 8.375, while the model with actual occupancy had an R² of 0.664 and an RMSE of 8.437. The close values of R² and RMSE for the predicted and actual occupancy models suggest that the error introduced using predicted occupancy data is minimal. This indicates that Wi-Fi-based occupancy prediction is a viable method for improving the data inputs for building energy models, leading to more accurate energy consumption predictions.
Data Analysis: The predicted and actual energy consumption versus occupant count and outdoor air temperature are represented in Figure 9. This figure visually represents the relationship between energy consumption, occupant count, and outdoor air temperature. The graph illustrates how energy consumption varies with changes in the number of occupants and how it correlates with outdoor air temperature. The alignment of predicted energy consumption with actual consumption across different occupant counts and varying outdoor temperatures indicates the model’s accuracy in capturing these dynamics. This comprehensive view helps validate the model’s effectiveness in predicting energy consumption based on occupancy and environmental factors.

5.4. Implications of Using Predicted Occupancy Data for Energy Prediction Modelling

Using anticipated occupancy data predicted from Wi-Fi connections offers several advantages. Firstly, it provides a more accurate representation of actual building usage, which directly influences energy consumption patterns. Filling the gaps where motion sensor data are unavailable ensures a continuous and consistent dataset, leading to more accurate energy consumption predictions. Accurate occupancy data are crucial as it directly impacts HVAC usage, lighting, and other energy-intensive systems. Secondly, anticipated occupancy data allows building management systems to adjust energy usage dynamically. For instance, predicting higher occupancy can lead to pre-emptively adjusting HVAC settings to ensure comfort, thus optimizing energy use and reducing wastage. This proactive adjustment enhances the efficiency of energy management practices and ensures better occupant comfort. Furthermore, using anticipated occupancy data improves the responsiveness of energy management systems. By predicting occupancy levels, building management systems can proactively adjust HVAC settings, lighting, and other energy-consuming systems to match the expected demand. This not only reduces energy waste but also ensures that occupant comfort is maintained, leading to better overall building performance. Additionally, this method is flexible and adaptable, and is applicable across various building types without the need for extensive sensor installations.

While employing anticipated occupancy data has clear benefits, it also introduces potential error propagation, impacting energy prediction accuracy. Errors in occupancy predictions can arise from various sources, such as inaccuracies in the Wi-Fi data, changes in occupancy patterns, or environmental factors affecting Wi-Fi signals. When these errors are introduced into the occupancy data, they can propagate through the energy prediction model, leading to inaccuracies in the final energy consumption estimates. For example, overestimating occupancy levels could result in the unnecessary activation of HVAC systems, increasing energy consumption. Conversely, underestimating occupancy could lead to insufficient heating or cooling, compromising occupant comfort and potentially leading to higher energy usage as the system compensates. However, the error propagation method we followed ensured that there was minimal error transferred between the models. The close values of R² and RMSE for the predicted and actual occupancy models suggest that the error introduced using predicted occupancy data is minimal. This indicates that Wi-Fi-based occupancy prediction is a viable method for improving the data inputs for building energy models, leading to more accurate energy consumption predictions.

6. Conclusions

Building energy modelling (BEM) tools are essential for assessing and improving the energy performance of buildings and evaluating various energy-saving measures. These tools offer significant and cost-effective opportunities for energy conservation, which can substantially mitigate the rise in global energy consumption and its adverse environmental impacts. However, traditional physics-based BEM tools are often time-consuming, require significant user expertise, and involve complex assumptions. The proliferation of building-related data has enabled the application of machine learning (ML) approaches for predicting and optimizing building energy consumption. In this case study, multiple ML models were employed to predict energy use using measured data from an office building in Berkeley, California. Due to missing values in occupancy counts, which are critical features in our models, we predicted these counts using Wi-Fi data. For the occupancy prediction, the ANN model delivered the best results with an R² of 0.92 and an RMSE of 3.78. These findings demonstrate that Wi-Fi data can be used with acceptable accuracy to predict occupancy counts, which can, in turn, be utilized to analyze occupant comfort and enhance the accuracy of BEMs. For the energy prediction models, the Random Forest model outperformed others, achieving an R² of 0.85 and an RMSE of 4.29 when using the predicted occupancy count. Additionally, an error propagation method was performed to measure the error from the occupancy prediction model on the energy prediction model. The results indicated that Wi-Fi-based occupancy prediction is viable for improving the data inputs for building energy models, leading to more accurate energy consumption predictions.

However, the study has several limitations. The limited period during which both measured occupancy data and Wi-Fi data were available restricted our analysis to roughly three months of data for predicting one year. Furthermore, the absence of measured occupancy data during the COVID-19 pandemic limits the model’s applicability to normal operating conditions, Moreover, various Wi-Fi connection behaviours can significantly impact the model’s performance and applicability. Long-term connections provide stable occupancy indicators, while short-term connections may introduce noise from transient visitors. The occupancy data from Wi-Fi does not account for non-connected occupants, and the density of Wi-Fi access points and spatial user distribution can also affect model accuracy. High-density areas with frequent handoffs create complex data patterns that the model must accurately interpret.

To overcome the aforementioned limitations, future studies should aim to collect longitudinal data encompassing both pandemic and post-pandemic conditions. Incorporating scenario analysis to model different occupancy patterns, including pandemic-like conditions, typical pre-pandemic occupancy, and potential future scenarios, can help develop more adaptable prediction models. Additionally, combining multiple data sources, such as Wi-Fi data, CO₂ levels, and PIR sensor data, can provide a more holistic view of occupancy patterns. Moreover, future models should incorporate mechanisms to differentiate between stable and transient connections and consider the spatial configuration of access points. Cross-validation with data from multiple building types with varying Wi-Fi infrastructures will enhance the generalizability and robustness of the occupancy prediction models. Furthermore, exploring more advanced machine learning models, such as deep learning and time series models, could provide comparative insights into their efficiency relative to the models deployed in this study. Additionally, utilizing larger datasets with more data points and fewer missing values will help develop a more comprehensive and inclusive model. By addressing these limitations and exploring these future research directions, we can further enhance the effectiveness and applicability of machine learning models in predicting and optimizing building energy consumption, contributing to more sustainable energy management practices in the built environment.

Author Contributions

Conceptualization, M.E. and A.A.-S.; methodology, M.E. and G.A.; formal analysis, M.E. and G.A.; data curation, M.E., A.A.-S., E.M.A. and G.A.; investigation, M.E., A.A.-S. and G.A.; resources, M.E., G.A., E.M.A. and A.A.-S.; writing—original draft preparation, M.E., A.A.-S., G.A. and E.M.A.; writing—review and editing, M.E., G.A., E.M.A. and A.A.-S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Some or all data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

Researchers Supporting Project number (RSPD2024R899), King Saud University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Key World Energy Statistics. 2007. Available online: http://eng.sut.ac.th/transportenergy/data/paper4web/aborad%20report/indicator/Key_Stats_2007%20-%20Australia.pdf (accessed on 19 August 2024).
Cao, X.; Dai, X.; Liu, J. Building energy-consumption status worldwide and the state-of-the-art technologies for zero-energy buildings during the past decade. Energy Build. 2016, 128, 198–213. [Google Scholar] [CrossRef]
US EIA. How Much Energy Is Consumed in the World by Each Sector; US Energy Information Administration: Washington, DC, USA, 2014.
Saudi Energy Efficiency Center (SEEC). Saudi Energy Efficiency Program (SEEP). 2014. Available online: https://www.seec.gov.sa/en (accessed on 15 May 2024).
Eurostat. Renewable Energy Statistics. 2023. Available online: https://ec.europa.eu/eurostat/statistics-explained/index.php/Renewable_energy_statistics (accessed on 26 June 2024).
Pedersen, T.H.; Hedegaard, R.E.; Petersen, S. Space heating demand response potential of retrofitted residential apartment blocks. Energy Build. 2017, 141, 158–166. [Google Scholar] [CrossRef]
Li, D.; Hu, G.; Spanos, C.J. A data-driven strategy for detection and diagnosis of building chiller faults using linear discriminant analysis. Energy Build. 2016, 128, 519–529. [Google Scholar] [CrossRef]
Zhao, H.; Magoulès, F. A review on the prediction of building energy consumption. Renew. Sustain. Energy Rev. 2012, 16, 3586–3592. [Google Scholar] [CrossRef]
Heo, Y.; Zavala, V.M. Gaussian process modeling for measurement and verification of building energy savings. Energy Build. 2012, 53, 7–18. [Google Scholar] [CrossRef]
Wang, Z.; Srinivasan, R.; Wang, Y. Homogeneous Ensemble Model for Building Energy Prediction: A Case Study Using Ensemble Regression Tree. 2016. Available online: https://www.aceee.org/files/proceedings/2016/data/papers/12_189.pdf (accessed on 15 May 2024).
Reeves, T.; Olbina, S.; Issa, R. Validation of building energy modeling tools: Ecotect^TM, Green Building Studio^TM and IES^TM. In Proceedings of the 2012 Winter Simulation Conference (WSC), Berlin, Germany, 9–12 December 2012; pp. 1–12. [Google Scholar] [CrossRef]
Ryan, E.M.; Sanquist, T.F. Validation of building energy modeling tools under idealized and realistic conditions. Energy Build. 2012, 47, 375–382. [Google Scholar] [CrossRef]
Aydinalp, M.; Ismet Ugursal, V.; Fung, A.S. Modeling of the appliance, lighting, and space-cooling energy consumptions in the residential sector using neural networks. Appl. Energy 2002, 71, 87–110. [Google Scholar] [CrossRef]
Yalcintas, M.; Aytun Ozturk, U. An energy benchmarking model based on artificial neural network method utilizing US Commercial Buildings Energy Consumption Survey (CBECS) database. Int. J. Energy Res. 2007, 31, 412–421. [Google Scholar] [CrossRef]
Yu, Z.; Haghighat, F.; Fung, B.C.M.; Yoshino, H. A decision tree method for building energy demand modeling. Energy Build. 2010, 42, 1637–1646. [Google Scholar] [CrossRef]
Ekici, B.B.; Aksoy, U.T. Prediction of building energy consumption by using artificial neural networks. Adv. Eng. Softw. 2009, 40, 356–362. [Google Scholar] [CrossRef]
Moon, J.; Park, S.; Rho, S.; Hwang, E. A comparative analysis of artificial neural network architectures for building energy consumption forecasting. Int. J. Distrib. Sens. Netw. 2019, 15, 1–19. [Google Scholar] [CrossRef]
Lee, S.; Jung, S.; Lee, J. Prediction model based on an artificial neural network for user-based building energy consumption in South Korea. Energies 2019, 12, 608. [Google Scholar] [CrossRef]
Burkhart, M.C.; Heo, Y.; Zavala, V.M. Measurement and verification of building systems under uncertain data: A Gaussian process modeling approach. Energy Build. 2014, 75, 189–198. [Google Scholar] [CrossRef]
Yoon, Y.R.; Moon, H.J. Energy consumption model with energy use factors of tenants in commercial buildings using Gaussian process regression. Energy Build. 2018, 168, 215–224. [Google Scholar] [CrossRef]
Hong, G.; Choi, G.S.; Eum, J.Y.; Lee, H.S.; Kim, D.D. The hourly energy consumption prediction by KNN for buildings in community buildings. Buildings 2022, 12, 1636. [Google Scholar] [CrossRef]
Dong, B.; Cao, C.; Lee, S.E. Applying support vector machines to predict building energy consumption in tropical region. Energy Build. 2005, 37, 545–553. [Google Scholar] [CrossRef]
Liu, Y.; Chen, H.; Zhang, L.; Wu, X.; Wang, X.J. Energy consumption prediction and diagnosis of public buildings based on support vector machine learning: A case study in China. J. Clean. Prod. 2020, 272, 122542. [Google Scholar] [CrossRef]
Ma, Z.; Ye, C.; Li, H.; Ma, W. Applying support vector machines to predict building energy consumption in China. Energy Procedia 2018, 152, 780–786. [Google Scholar] [CrossRef]
Guven, D.; Kayalica, M.O. Analysing the determinants of the Turkish household electricity consumption using gradient boosting regression tree. Energy Sustain. Dev. 2023, 77, 101312. [Google Scholar] [CrossRef]
Ghenai, C.; Al-Mufti, O.A.A.; Al-Isawi, O.A.M.; Amirah, L.H.L.; Merabet, A. Short-term building electrical load forecasting using adaptive neuro-fuzzy inference system (ANFIS). J. Build. Eng. 2022, 52, 104323. [Google Scholar] [CrossRef]
Durand, D.; Aguilar, J.; R-Moreno, M.D. An analysis of the energy consumption forecasting problem in smart buildings using LSTM. Sustainability 2022, 14, 13358. [Google Scholar] [CrossRef]
Kim, D.; Lee, Y.; Chin, K.; Mago, P.J.; Cho, H.; Zhang, J. Implementation of a long short-term memory transfer learning (LSTM-TL)-based data-driven model for building energy demand forecasting. Sustainability 2023, 15, 2340. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, Q.; Yuan, T.; Yang, F. Effect of input variables on cooling load prediction accuracy of an office building. Appl. Therm. Eng. 2018, 128, 225–234. [Google Scholar] [CrossRef]
Ma, J.; Cheng, J.C. Identifying the influential features on the regional energy use intensity of residential buildings based on Random Forests. Appl. Energy 2016, 183, 193–201. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, Q.; Yuan, T.; Yang, K. Model input selection for building heating load prediction: A case study for an office building in Tianjin. Energy Build. 2018, 159, 254–270. [Google Scholar] [CrossRef]
Gunay, B.; Shen, W.; Newsham, G. Inverse blackbox modeling of the heating and cooling load in office buildings. Energy Build. 2017, 142, 200–210. [Google Scholar] [CrossRef]
Sholahudin, S.; Han, H. Simplified dynamic neural network model to predict heating load of a building using Taguchi method. Energy 2016, 115, 1672–1678. [Google Scholar] [CrossRef]
Kapetanakis, D.-S.; Mangina, E.; Finn, D.P. Input variable selection for thermal load predictive models of commercial buildings. Energy Build. 2017, 137, 13–26. [Google Scholar] [CrossRef]
Cui, X.; Lee, M.; Koo, C.; Hong, T. Energy consumption prediction and household feature analysis for different residential building types using machine learning and SHAP: Toward energy-efficient buildings. Energy Build. 2024, 309, 113997. [Google Scholar] [CrossRef]
Fan, C.; Xiao, F.; Zhao, Y. A short-term building cooling load prediction method using deep learning algorithms. Appl. Energy 2017, 195, 222–233. [Google Scholar] [CrossRef]
Huang, Q.; Mao, C. Occupancy Estimation in Smart Building Using Hybrid CO₂/Light Wireless Sensor Network. J. Appl. Sci. Arts 2016, 1, 5. Available online: https://opensiuc.lib.siu.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=1011&context=jasa (accessed on 6 May 2024).
Tien, P.W.; Wei, S.; Calautit, J. A Computer Vision-Based Occupancy and Equipment Usage Detection Approach for Reducing Building Energy Demand. Energies 2021, 14, 156. [Google Scholar] [CrossRef]
Jiang, J.; Wang, C.; Roth, T.; Nguyen, C.; Kamongi, P.; Lee, H.; Liu, Y. Residential House Occupancy Detection: Trust-Based Scheme Using Economic and Privacy-Aware Sensors. IEEE Internet Things J. 2022, 9, 1938–1950. [Google Scholar] [CrossRef]
Mashuk, M.S.; Pinchin, J.; Siebers, P.-O.; Moore, T. A smart phone based multi-floor indoor positioning system for occupancy detection. In Proceedings of the 2018 IEEE/ION Position, Location and Navigation Symposium (PLANS), Monterey, CA, USA, 23–26 April 2018; pp. 216–227. [Google Scholar] [CrossRef]
Gaonkar, P.; Bapat, J.; Das, D.; Rao, S.V. Occupancy Estimation in Semi-Public Spaces using Sensor Fusion and Context Awareness. In Proceedings of the 2019 IEEE Region 10 Symposium (TENSYMP), Kolkata, India, 7–9 June 2019; pp. 131–136. [Google Scholar] [CrossRef]
Fisk, W.J. A Pilot Study of the Accuracy of CO2 Sensors in Commercial Buildings. 2008. Available online: https://escholarship.org/uc/item/78t0t90v (accessed on 21 May 2024).
Garg, V.; Bansal, N.K. Smart occupancy sensors to reduce energy consumption. Energy Build. 2000, 32, 81–87. [Google Scholar] [CrossRef]
Jin, Y.; Yan, D.; Sun, H. Lighting System Control in Office Building Using Occupancy Prediction Based on Historical Occupied Ratio. IOP Conf. Ser. Earth Environ. Sci. 2019, 238, 012009. [Google Scholar] [CrossRef]
Vafeiadis, T.; Zikos, S.; Stavropoulos, G.; Ioannidis, D.; Krinidis, S.; Tzovaras, D.; Moustakas, K. Machine Learning Based Occupancy Detection via the Use of Smart Meters. In Proceedings of the 2017 International Symposium on Computer Science and Intelligent Controls (ISCSIC), Budapest, Hungary, 20–22 October 2017; pp. 6–12. [Google Scholar] [CrossRef]
Akkaya, K.; Guvenc, I.; Aygun, R.; Pala, N.; Kadri, A. IoT-based occupancy monitoring techniques for energy-efficient smart buildings. In Proceedings of the 2015 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), New Orleans, LA, USA, 9–12 March 2015; pp. 58–63. [Google Scholar] [CrossRef]
Ouf, M.M.; Issa, M.H.; Azzouz, A.; Sadick, A.-M. Effectiveness of using WiFi technologies to detect and predict building occupancy. Sustain. Build. 2017, 2, 7. [Google Scholar] [CrossRef]
Eweda, A.; Al-Sakkaf, A.; Zayed, T.; Alkass, S. Condition assessment model of building indoor environment: A case study on educational buildings. Int. J. Build. Pathol. Adapt. 2021, 41, 767–788. [Google Scholar] [CrossRef]
Melfi, R.; Rosenblum, B.; Nordman, B.; Christensen, K. Measuring building occupancy using existing network infrastructure. In Proceedings of the 2011 International Green Computing Conference (IGCC), Orlando, FL, USA, 25–28 July 2011; pp. 1–8. [Google Scholar] [CrossRef]
Martani, C.; Lee, D.; Robinson, P.; Britter, R.; Ratti, C. ENERNET: Studying the dynamic relationship between building occupancy and energy consumption. Energy Build. 2012, 47, 584–591. [Google Scholar] [CrossRef]
Zou, H.; Zhou, Y.; Jiang, H.; Chien, S.-C.; Xie, L.; Spanos, C.J. WinLight: A WiFi-based occupancy-driven lighting control system for smart building. Energy Build. 2018, 158, 924–938. [Google Scholar] [CrossRef]
Balaji, B.; Xu, J.; Nwokafor, A.; Gupta, R.; Agarwal, Y. Sentinel: Occupancy based HVAC actuation using existing WiFi infrastructure within commercial buildings. In Proceedings of the SenSys ’13: The 11th ACM Conference on Embedded Network Sensor Systems, Roma, Italy, 11–15 November 2013; pp. 1–14. [Google Scholar] [CrossRef]
Wang, Z.; Hong, T.; Piette, M.A.; Pritoni, M. Inferring occupant counts from Wi-Fi data in buildings through machine learning. Build. Environ. 2019, 158, 281–294. [Google Scholar] [CrossRef]
Martínez-Plumed, F.; Contreras-Ochando, L.; Ferri, C.; Hernández-Orallo, J.; Kull, M.; Lachiche, N.; Ramirez-Quintana, M.J.; Flach, P. CRISP-DM twenty years later: From data mining processes to data science trajectories. IEEE Trans. Knowl. Data Eng. 2019, 33, 3048–3061. [Google Scholar] [CrossRef]
Luo, N.; Wang, Z.; Blum, D.; Weyandt, C.; Bourassa, N.; Piette, M.A.; Hong, T.A. Three-Year Dataset Supporting Research on Building Energy Management and Occupancy Analytics. Sci. Data 2022, 9, 156. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Hong, T.; Li, N.; Wang, R.Q.; Chen, J. Linking energy-cyber-physical systems with occupancy prediction and interpretation through WiFi probe-based ensemble classification. Appl. Energy 2019, 236, 55–69. [Google Scholar] [CrossRef]
Alishahi, N.; Ouf, M.M.; Nik-Bakht, M. Using WiFi connection counts and camera-based occupancy counts to estimate and predict building occupancy. Energy Buildings 2022, 257, 111759. [Google Scholar] [CrossRef]
Alishahi, N.; Nik-Bakht, M.; Ouf, M.M. A framework to identify key occupancy indicators for optimizing building operation using WiFi connection count data. Build. Environ. 2021, 200, 107936. [Google Scholar] [CrossRef]
Sun, Y.; Haghighat, F.; Fung, B.C.M. A review of the state-of-the-art in data-driven approaches for building energy prediction. Energy Build. 2020, 221, 110022. [Google Scholar] [CrossRef]
Wang, W.; Chen, J.; Song, X. Modeling and predicting occupancy profile in office space with a Wi-Fi probe-based Dynamic Markov Time-Window Inference approach. Build. Environ. 2017, 125, 110–122. [Google Scholar] [CrossRef]
Hou, H.; Pawlak, J.; Sivakumar, A.; Howard, B. An approach for building occupancy modelling considering the urban context. Build. Environ. 2020, 180, 106991. [Google Scholar] [CrossRef]
Huang, W.; Lin, Y.; Lin, B.; Zhao, L. Modeling and predicting the occupancy in a China hub airport terminal using Wi-Fi data. Energy Build. 2019, 199, 306–317. [Google Scholar] [CrossRef]

Figure 1. Proposed CRISP-DM framework.

Figure 2. The office building in Berkeley, California.

Figure 3. 2018 Wi-Fi connections (orange) and occupancy count (blue).

Figure 4. Adjusted Wi-Fi connections and occupancy data (blue line: total_wifi, orange line: total_occ).

Figure 5. DBSCAN results.

Figure 6. Pearson correlation analysis matrix.

Figure 7. Residual plot before incorporating occupancy predictions.

Figure 8. Residual plot following the incorporation of occupancy predictions.

Figure 9. Prediction vs. actual energy demand in 2020.

Table 1. Average performance indicators after a 10-fold cross-validation.

Attribute	Description	Available Period	Time Step (min)
Avg. merged int. temp.	Indoor air temp.	18 February–20 December	10
Water supply temp.	Heat pump heating water supply temperature	18 January–20 December	1
Fan speed	Supply air fan speed	18 January–20 December	1
Return air temp.	Roof Top Unit return air temp.	18 January–20 December	1
Air pressure SP	Roof Top Unit air pressure static setpoint	18 January–20 December	1
Supply air temp. SP	Roof Top Unit supply air temp. setpoint	18 January–20 December	1
Supply fan speed	Roof Top Unit supply fan speed	18 January–20 December	1
Mixed air temp.	Roof Top Unit mixed air temp.	18 January–20 December	1
Outdoor air temp.	Roof Top Unit outdoor air temp.	18 January–20 December	1
Return fan speed	Roof Top Unit return fan speed	18 January–20 December	1
Total energy consumption	Total electricity loads (miscellaneous, lighting, and HVAC)	18 January–20 December	15
Occupancy	Total occupant counts	18 May–19 February	1
Cooling SP	Cooling temp. setpoint	18 September–20 December	5
Heating SP	Heating temp. setpoint	18 September–20 December	5
RTU supply air temp.	Roof Top Unit supply air temp.	18 January–20 December	1
RTU filtered supply air flow rate	Roof Top Unit filtered supply air flow rate	18 January–20 December	1
RTU outdoor air flow rate	Roof Top Unit outdoor air flow rate	18 January–20 December	1
Air temp.	Outdoor air temp.	18 January–20 December	15
Relative humidity	Outdoor air relative humidity	18 January–20 December	15
Solar radiation	Outdoor solar radiation	18 January–20 December	15
Wi-Fi connection	Total Wi-Fi connection counts	May–18 July February–20 December	5

Table 2. Average performance indicators after a 10-fold cross-validation.

Model	RMSE	R²
DTR	9.53	0.790
MLR	9.54	0.789
ANNR	9.43	0.794

Table 3. Performance indicators before the occupancy prediction.

Model	RMSE	R² Split Validation	R² Cross-Validation	R² Grid Search
RF	4.29	0.829	0.83	0.85
SVR	4.33	0.825	0.83	0.85
DTR	5.84	0.68	0.69	0.77
MLR	7.34	0.5	0.5	0.5
PR	4.56	0.8	0.51	0.8
ANNR	4.22	0.836	0.83	-

Table 4. Performance indicators after the occupancy prediction.

Model	RMSE	R² Split Validation	R² Cross-Validation	R² Grid Search	Computation Time
RF	3.78	0.897	0.9	0.92	2 h 14 min
SVR	4.13	0.842	0.85	0.85	10 min 23 s
DTR	5.85	0.683	0.69	0.77	9 s
MLR	7.34	0.5	0.5	0.5	6 s
PR	4.56	0.807	0.51	0.79	1 min 53 s
ANNR	3.97	0.887	0.894	-	21 min 25 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Elkabalawy, M.; Al-Sakkaf, A.; Mohammed Abdelkader, E.; Alfalah, G. CRISP-DM-Based Data-Driven Approach for Building Energy Prediction Utilizing Indoor and Environmental Factors. Sustainability 2024, 16, 7249. https://doi.org/10.3390/su16177249

AMA Style

Elkabalawy M, Al-Sakkaf A, Mohammed Abdelkader E, Alfalah G. CRISP-DM-Based Data-Driven Approach for Building Energy Prediction Utilizing Indoor and Environmental Factors. Sustainability. 2024; 16(17):7249. https://doi.org/10.3390/su16177249

Chicago/Turabian Style

Elkabalawy, Moaaz, Abobakr Al-Sakkaf, Eslam Mohammed Abdelkader, and Ghasan Alfalah. 2024. "CRISP-DM-Based Data-Driven Approach for Building Energy Prediction Utilizing Indoor and Environmental Factors" Sustainability 16, no. 17: 7249. https://doi.org/10.3390/su16177249

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

CRISP-DM-Based Data-Driven Approach for Building Energy Prediction Utilizing Indoor and Environmental Factors

Abstract

1. Introduction

2. Background Literature

3. Research Framework

4. Development of the Energy Prediction Framework: A Case Study Approach

4.1. Data Understanding

4.2. Data Preparation

4.3. Building Energy Modelling

5. Results and Discussion

5.1. Feature Selection

5.2. Occupancy Prediction

5.3. Building Energy Prediction

5.3.1. Energy Prediction with Actual Occupancy

5.3.2. Energy Prediction with Predicted Occupancy

5.3.3. Error Propagation

5.4. Implications of Using Predicted Occupancy Data for Energy Prediction Modelling

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI