1. Introduction
The construction sector is essential for meeting growing needs and expanding social and economic activities while minimizing harm to the local environment [
1]. The building industry plays a crucial role in global energy consumption and greenhouse gas emissions, driving up energy costs and causing severe environmental damage like pollution and climate change as a major concern [
2]. For example, the building process alone in China produced 4.997 billion tons of carbon dioxide (CO
2) in 2019, representing 50.6% of the nation’s total carbon emissions, due to the country’s fast urbanization [
3]. Furthermore, the construction sector in China was responsible for 2.233 billion tons of standard coal equivalent (SCE), representing 46.5% of the country’s total energy consumption [
4]. In the European Union (EU), buildings also account for 36% of greenhouse gas emissions and more than 40% of energy usage [
5]. Globally, the building industry is trending toward increased energy usage, which is predicted to rise by 88% between 2003 and 2050 [
2,
6]. This issue has raised awareness of the need of near-energy-neutral green buildings in accomplishing sustainable development goals [
7]. Environmental, social, and human viewpoints all highlight the advantages of adopting green construction over traditional building practices [
8,
9]. Achieving great energy efficiency, lowering CO
2 emissions, saving money, and providing adaptable thermal comfort are among the benefits [
10,
11]. In recent years, the idea of “green building” has grown significantly. Evaluating and optimizing building energy performance in the preliminary stages of design are essential for developing a green building [
12,
13].
This early analysis aims to reduce waste and discomfort that result from poor design so that the building can reach its maximum potential for both energy efficiency and occupant comfort [
14]. Several energy simulation programs have been developed to gain insights into energy performance through dynamic modeling, including DOE-2, OpenStudio, Ecotect, DesignBuilder, and others [
15]. However, because so many parameters are needed, this physics-based modeling approach can be overly complex and occasionally lack computational efficiency [
16]. Adhering closely to internationally recognized green building certification systems that integrate sustainable design principles, such as China’s MHURD standards, the EU’s EPBD for near-zero-energy buildings, and LEED (Leadership in Energy and Environmental Design), is a simpler approach [
17]. However, comparing results to several criteria can be laborious, experience-based, and biased towards judgment [
18,
19]. The European Union established the Energy Performance of Buildings Directive (EPBD 2018/844/EU) among other regulations to ensure that every new construction conforms with the requirements for near-zero-energy buildings (nZEBs) [
20]. The Ministry of Housing and Urban–Rural Development of China creates the most widely used green building evaluation standard (MHURD) [
21]. It does, however, require a significant amount of time and work to assess a range of variables, in addition to experience. Furthermore, it may generate erroneous findings influenced by judgment and cognitive biases [
22,
23]. Since the building industry is presently becoming more information-intensive, it is beneficial to delve further, for hidden knowledge discovery, into the growing corpus of data on building energy efficiency [
24]. This is enabled by the rapidly growing big data sector. Notably, data-driven approaches have become more important in the assessment of green buildings to facilitate automated, effective, and impartial decision-making [
25]. Machine learning has emerged as a promising solution to overcome shortcomings of traditional building energy prediction methods during design [
26,
27]. Numerous algorithms, including multi-layer perceptron, ensemble learning, support vector machines, and others, have the following advantages: they are highly efficient, have a simplified parameter structure that is appropriate for the early stages of design, consistently perform well in predictions, and have an excellent generalization ability to comprehend complex energy systems [
28,
29]. Machine learning techniques play a significant role in offering insightful information about the intricate relationship between the performance of green buildings and various influential factors, including personnel activities, façade openings, the envelope structure, and facility operational efficiency [
30]. This information enables decision-makers to identify potential issues with design early on and take appropriate action [
31].
The following three areas still require improvement, even though many studies have produced highly effective machine-learning-based energy prediction models [
32]. First, it should be mentioned that fine-tuning hyperparameters is essential to managing a machine learning model’s behavior [
33]. A more promising prediction quality will undoubtedly be attained by carefully choosing the ideal hyperparameter setup [
34]. Studies that have looked at different machine learning techniques for predicting energy performance have paid less attention to the automated adjustment process [
35]. Despite being labor-intensive, manual parameter tuning is still common [
36]. This approach enhances the model’s repeatability and reliability by rapidly determining the optimal hyperparameter combination in fewer iterations. However, research suggests that machine learning is most effectively applied during the operational phase rather than the design phase [
37].
Since effective design accounts for about 30% of energy savings, it is desirable to fully incorporate machine learning from the design stage to support the decisions made by building designers [
38]. Nevertheless, it is challenging to directly interpret most machine learning approaches since they are complex [
39]. A lack of confidence may result from an inability to understand the results and predictive models. The solution to this is to use explainable machine learning algorithms that generate understandable explanations of the variable importance and prediction mechanism [
40]. This may unlock the mystery, offering believable justifications and boosting the models’ level of trust. The best practices for green building design are currently being discussed [
41]. This decision-making technique, a multi-objective optimization (MOO) task, can be seen as an alternative to conventional human judgment [
42]. MOO integrates with the well-known machine learning methodology to produce Pareto-compromised solutions without the need for complex equations [
43]. This makes the prediction models more useful in actual situations and makes it easier to create data-driven, optimal plans for green buildings [
44,
45]. Considering sources of uncertainty is a necessary step towards making data-driven strategies more robust [
46,
47]. The goal of this study is to provide a system that combines multi-objective optimization, explainable machine-learning-based prediction, and simulation based on building information modeling (BIM) to provide data-driven assistance for the design of successful green buildings from the ground up. The novel aspect of this study is the hybrid algorithm that uses computational intelligence methods to extract information about various aspects of building energy usage from massive volumes of BIM-based simulation data. It still has a strong capacity for high generalization, simultaneous optimization, in-depth explanation, and autonomous learning. The usefulness of this research lies in its potential to function as a trustworthy instrument for decision-making, enhancing computational efficiency and objectivity in the process of pinpointing the most important variables and effectively managing features of interest. By adhering to accurate forecasts and practical recommendations derived from the proposed data-driven analysis, green buildings can meet their objectives of minimizing our environmental impact, enhancing indoor thermal comfort, and reducing energy consumption from the early design stage onward. The rest of the manuscript covers the following:
Section 2—Overview of relevant research,
Section 3—Methodology,
Section 4—Case study validating the proposed method’s performance,
Section 5—Reliability under uncertainty sources,
Section 6—Conclusions and future research recommendations.
3. Methodology
A unique hybrid framework combining explainable machine learning and multi-objective optimization approaches is offered for intelligent prediction and data-driven improvement of green building performance. The framework, outlined in
Figure 1, comprises three key components providing robust knowledge support across two optimization scenarios and intelligent forecasting. Initially, an orthogonal testing and BIM-based simulation approach aids in curating a multi-feature dataset. Several crucial features closely linked to green building energy efficiency have been identified to develop a multi-feature assessment system. Subsequently, a prediction model dubbed Bayesian optimization–Light Gradient Boosting Machine (BO-LGBM) is constructed by synergizing ensemble learning with Bayesian optimization. Moreover, to enhance model interpretability, LIME (Local Interpretable Model-agnostic Explanations) values quantify the significance of each input feature towards the target objective [
92]. In the third step, the generated metamodel is subjected to the multi-objective optimization (MOO) method of the Adaptive Genetic Ensemble of Multi-Objective Evolutionary Algorithms (AGE-MOEA) to determine the optimal solutions for constructing aesthetically pleasing and long-lasting buildings. Two scenarios are included in the data-driven optimization framework: the deterministic scenario and the uncertain scenario. The main difference between them is how the latter manages uncertainty. In particular, the deterministic scenario does not take into consideration the combination of model and data uncertainty, while the uncertain scenario does. The uncertain scenario may thus improve the robustness and dependability of choices made on the design of green buildings by specifically including these uncertainties into the optimization process. A data-driven analytical framework for green building design aims to proactively construct an assessment index system that includes objectives and contributing elements. It is essential to create a dataset on building energy performance based on the established evaluation methodology by utilizing DesignBuilder (2020) simulation and BIM modeling. First, using Revit software (2020), a geometrically precise 3D BIM model of the suggested building is produced.
A Common Data Environment (CDE) is built as part of the BIM-based design information management process to enable data integration into the model [
93]. Next, a computational simulation program called DesignBuilder is employed for dynamic simulation, considering multiple parameters to provide accurate energy performance estimation [
94]. DesignBuilder offers an intuitive graphical user interface for EnergyPlus software (2020), with two key advantages—importing the BIM model in gbXML format eliminates the need for recreating an analysis model, and when supplied with parameters, DesignBuilder as a simulation engine enables highly realistic simulations accounting for thermal mass, glazing, HVAC, and interactions across building systems/components. Orthogonal testing for efficiency and simplicity underpins the DesignBuilder-based dynamic simulations. The core concept is utilizing an orthogonal array to streamline multi-factor studies by significantly reducing experiment numbers while ensuring uniform data distribution across the test range [
95]. Data preparation involving noise removal, standardization, and transformations to enhance training usefulness for machine learning models is a prerequisite after data collection. Finally, it is possible to produce a better dataset to support the data-driven study of energy efficiency in green buildings. This methodology detail is explained in Algorithm 1 below:
Algorithm 1 Pseudocode for Research. |
Input: Revit model, Target objectives, Design parameters Output: Optimal solutions, Probability constraints |
|
In this algorithm the comprehensive process involves BIM, Bayesian optimization with LGBM modeling, and MOO to derive optimal and reliable solutions from a Revit model. Initially, a Revit-based 3D model was created and simulated to generate results and parametric features. These features help define objectives and prepare datasets for model training. The next phase employs Bayesian optimization to fine-tune hyperparameters for training a predictive LGBM model, which is then evaluated for performance. To enhance model interpretability, LIME is applied, followed by Monte Carlo simulations to assess prediction robustness. The final phase focuses on formulating a multi-objective optimization problem using the defined objectives and design parameters. The AGE MOEA algorithm is executed to determine the Pareto front, from which optimal solutions are selected. These solutions undergo further evaluation to establish probability constraints, ensuring their reliability. The algorithm systematically combines simulation, machine learning, and optimization techniques to achieve optimal design solutions while considering performance variability and reliability.
3.1. Predicting Building Energy Performance Using Ensemble Learning with Hyperparameter Optimization
A branch of artificial intelligence called “machine learning” is focused on learning from and adapting to large volumes of data. For predictive analytics, the framework makes it possible to simulate the nonlinear correlations accurately and automatically between important parameters and energy performance objectives. When compared to a single model, ensemble learning provides superior prediction accuracy and resilience by combining the predictive outputs of many base learners into a strong learner. The widely used ensemble learning method known as gradient boosting decision tree (GBDT) offers superior interpretability, accuracy, and efficiency. With GBDT, decision trees are built additively rather than independently as in typical random forests. It trains each tree by fitting the residual errors from the prior iteration, resulting in faster and more precise predictions. Introduced in 2017, LightGBM (LGBM) is an effective GBDT implementation designed to handle large-scale data with high feature dimensionality efficiently [
96]. Model performance is enhanced in terms of quicker training durations, less memory use, more accuracy, and better scalability by using tree-based learning algorithms in LGBM, a distributed and extremely effective gradient boosting framework [
97]. Motivated by these advantages, the metamodel used in this study to forecast building energy performance is LGBM. Equations (1) and (2) demonstrate how the mean absolute percentage error (MAPE) and coefficient of determination (
R2) are used to quantitatively assess the performance of the LGBM-based prediction:
where
yi is the predicted value,
ŷi is the measured value, and
yi is the mean of the measured value.
Gradient-based One-side Sampling (GOSS) and Exclusive Feature Bundling (EFB) are two novel concepts that LGBM integrates. During instance down-sampling based on gradients, GOSS, a unique sampling technique, randomly discards instances with lesser gradients and keeps examples with bigger gradients, resulting in a more accurate estimation of information gain with a significantly reduced data size. EFB successfully avoids unnecessary actions on zero feature values by lowering feature dimensions, recognizing and combining mutually incompatible features into fewer dense features, and doing so with almost lossless Ly. Algorithms related to LGBM can achieve good prediction performance with a much faster and simpler training process. The mean absolute percentage error (MAPE) and coefficient of determination (
R2) are used to objectively assess the performance of the LGBM-based prediction [
98]. The goodness of fit is measured by
R2, and scale independence and interpretability are provided by MAPE, which is the average absolute percentage error. As in Equation (3), higher
R2 values nearer one and lower MAPE values nearer zero denote better prediction performance.
where
f(
x) is the objective function.
Furthermore, finding an improved hyperparameter setup helps create a machine learning model that performs better in predictions. In this sense, optimizing model design now heavily depends on hyperparameter adjustment. Conventional manual parameter searches can be time-consuming and tedious. An automated hyperparameter optimization (HPO) procedure is required to solve this problem and improve the machine learning model’s reproducibility and usefulness while requiring fewer human interactions [
99]. Surprisingly, Bayesian optimization (BO) has become a potent hyperparameter tuning method that makes it possible to effectively optimize costly black-box functions globally [
100]. LGBM incorporates two innovative concepts: Exclusive Feature Bundling (EFB) and Gradient-based One-side Sampling (GOSS). The estimated variance
Ṽj(d) is obtained from Equation (4):
where the sum of gradients throughout dataset
B with occurrences in lower gradients is standardized using the coefficient (1 −
a)/
b. Using a Gaussian process to assess surrogate uncertainty, Bayesian optimization, as opposed to random and grid searching, builds a probability model of the objective function. Its distinctive features include its ability to save historical assessments and rapidly determine, in fewer configuration space iterations, the ideal set of hyperparameters.
3.2. Multi-Objective Optimization and Explainable Machine Learning for Green Building Design
The metamodel-determined nonlinear relationship between inputs and outputs remains incompletely accounted for by the LGBM model, notwithstanding its exceptional predictive capability. To generate an explainable machine learning solution, a method known as LIME (Local Interpretable Model-agnostic Explanations), which was introduced in 2017, measures each feature’s contribution to the LGBM-based prediction [
101,
102]. Managers can have more faith in the forecast findings because of LIME’s ability to provide insights into the operation of the LGBM model. When it comes to providing attribution values that are locally accurate, consistent, and unique based on game theory, LIME outperforms traditional feature significance approaches [
103,
104]. The LIME value, which may be computed, takes the meaning of a feature value’s marginal contribution over all conceivable feature combinations [
105].
where
S is the subset of input features that excludes the ith feature,
fx (
S ∧ (
i)) is the model output with the ith feature, and
fx(
S) = E(
f(
x)|
xS) is the model output without the ith feature (the expected value of the function conditioned on
S). However, the computation efficiency of calculating E(
f(
x)|
xS) is low, and the LIME value calculation is exponentially complex. Therefore, a speedier estimate version called tree LIME was created, which is better able to understand how each feature influences the outcome and comprehend tree-based machine learning models like LGBM. Tree LIME reduces the computational complexity from O(TL
2M) to O(TLD
2) when T is the number of trees, L is the maximum number of leaves a tree may have, and D is the maximum depth of the tree. The integration of LIME into LGBM facilitates the advancement of traditional machine learning models towards more transparency, hence augmenting the model’s usefulness and decision-making trust.
Multi-objective optimization (MOO) can also be defined as the problem of finding the most effective data-driven design strategies for green buildings. Energy consumption, carbon emissions, and interior thermal comfort are three goals connected to green buildings that may be optimally optimized at the same time by applying the MOO process to the established BO-LGBM metamodel. The MOO issue and optimization constraints may be stated mathematically as follows:
where
X is a feature vector made up of twelve variables
x from a feasible space
D, and
F(
X) stands for the prediction function from the BO-LGBM model. There is not a single optimal answer to a MOO problem. Alternatively, it is possible to acquire the entire set of Pareto-optimal solutions
x′ = (
x′
1…,
x′
k), which satisfy:
As shown in
Figure 2, these Pareto front-based solutions are non-dominated, which means they outperform every other solution
x = (
x1…,
xk) in the remaining search space.