**1. Introduction**

Buildings account for 40% of global energy consumption [1]. Of this figure over 60% of the energy is consumed in the form of electricity, and only 23% of it is supplied by renewable sources [2]. Studies about nearly zero-energy building (NZEB) and positive energy districts (PEDs) have recently drawn much attention to possible ways to reduce this energy demand [3,4]. NZEB buildings have a very high energy performance level, with the nearly zero or very low amount of energy provided by significant renewable sources, and their energy performance is determined on the basis of calculated or actual annual energy usage [5–7]. PEDs are defined as energy-efficient and energy-flexible urban areas with a surplus renewable energy production and net zero greenhouse gas emissions. For both NZEB and PED, building energy performance is a crucial criteria for indicating their energy achievements [8].

Building energy modeling is an efficient way to predict the different possible performance levels of a building [9]. Among modeling methods, data-driven approaches have shown their advantages in building energy modeling, especially at an urban scale [10–13]. Basically, a data-driven approach is a systematic framework of data modeling techniques comprising model selection, parameter estimation and model validation that creates analytical decision-making. Most of the machine learning methods are data-driven since the machines or models are trained by learning a mapping function that operates between the

**Citation:** Han, M.; Wang, Z.; Zhang, X. An Approach to Data Acquisition for Urban Building Energy Modeling Using a Gaussian Mixture Model and Expectation-Maximization Algorithm. *Buildings* **2021**, *11*, 30. https:// doi.org/10.3390/buildings11010030

Received: 29 November 2020 Accepted: 14 January 2021 Published: 16 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

input and output of the data. The more experience a machine has at learning, the better performance it will get. Thus, acquiring sufficient data is the basis to identify accurate energy use patterns and decide on the optimal actions to take in response. However, acquiring sufficient data in high quality for buildings at the urban level is a real challenge. Either random missing values or a large amount of incomplete information jeopardizes the model's validity. As data in urban building energy modeling (UBEM) collected from different sources, it can take significant effort to integrate datasets into a standardized format for interoperability [14].

By identifying such a research gap around the acquisition of data for urban building energy modeling, this paper aims to develop a novel approach. The specific contributions of this work are as follows: (1) it proposes to use a Gaussian mixture model (GMM), trained by Expectation-Maximization (EM) algorithm, as a generative model to discover the populations where the data can be drawn from; (2) it uses real datasets to validate parameter estimation and generative performance; (3) it suggests that the bivariate Gaussian model is more robust than the univariate model; and (4) it discusses the practical ways in which the initial values of the EM algorithm can be set.

The rest of the paper is structured as follows: based on an extensive literature review, the necessities and challenges of modeling with sufficient data are discussed in Section 2. Section 3 continues with a brief summary of the different ways to acquire more data. The philosophies of GMM and EM are presented in Section 4. In Section 5, the real datasets, parameter estimation details, and model evaluations are introduced. Section 6 discusses the spatio-temporal mapping of the synthetic data. It is followed by a conclusion in the final section.

#### **2. Necessities and Challenges of Building Performance Modeling with Big Data**

The analysis of building energy performance is switching from single buildings to district and urban levels. It yields new research domains that are associated with building energy performance, such as transportation, spatial correlations, grid operations, energy market actions and so on. Together with the factors, such as occupant behavior and climate change affecting single building modeling, a large amount of data are being produced in different domains, and the causal relationships between them are becoming complex. For instance, Zhang et al. reviewed a modeling technique for urban energy systems at the building cluster level that incorporated renewable-energy-source envelope solutions, in which they highlighted that a high-resolution energy profile, a spatio-temporal energy demand (both building and mobility) and detailed engineering/physical and statistical models are desirable for further development [15]. Salim et al. defined occupant-centric urban data and the pipeline to process it in a review paper that also outlined the different sources of urban data for modeling urban-scale occupant behavior: mobility and energy in buildings [16]. Perera et al. developed a stochastic optimization method to consider the impact of climate change and extreme climate events on urban energy systems across 30 cities in Sweden by considering 13 climate change scenarios [17]. Yang et al. started the data-driven urban energy benchmarking of buildings that uses recursive partitioning and stochastic frontier analysis in their study of a dataset of over 10,000 buildings in New York City [18]. By examining 3640 residential buildings, Ma and Cheng estimated building energy use intensity at an urban scale by integrating geographic information system (GIS) and big data technology [19]. Pasichnyi et al. proposed a data-driven building archetype for urban building energy modeling that uses rich datasets [10]. Risch et al. presented a level scheme to study the influence of data acquisition on the individual Bayesian calibration of archetype for UBEMs [20]. Ali et al. developed a data-driven approach to optimize urban-scale energy retrofit decisions for residential buildings while acknowledging that developing a building stock database is a time-intensive process that requires extensive data (both geometric and non-geometric), that can, in its uncollected form, be sparse, inconsistent, diverse and heterogeneous in nature [11].

When training a model, inadequacies in the data acquisition process generally mean that important information will be lacking, and the capture of underlying features is badly carried out. For example, energy policies may be misleading, and the data derived from them can be unclear or ambiguous. Securing sufficient data from different domains and in a finer resolution at the urban level will improve the quality of a model's decisionmaking. Understanding the impact of insufficient data on building energy modeling allows challenges to be identified, limits to the model's capacity to be broken and ways of improving the data situation to be found [21]. Modern machine learning techniques, especially deep learning (DL) methods, are a powerful way to model large amounts of urban energy data. The reason that DL outperforms other methods is that it uses millions of parameters to create sophisticated, nonlinear relationships that map the input data to the output data. The central goal is to train the bulk of the parameters so that they not only fit the training set well, but that they are also able to work on a dataset that the model has not seen. The ability to perform well on previously unobserved inputs is called generalization [22]. Small data sets do not provide enough epochs to update the parameters, which means that the model can be perfectly trained, and the output can be mapped to an extremely high accuracy, but only on an observed dataset. This leads to the problem of overfitting. Feeding sufficient data to the model is the equivalent of enabling it to discover more comprehensive mapping rules and enhancing, therefore, the model's generalization ability.

Data capture and storage, data transformation, data curation, data analysis and data visualization are all challenges when working with big data [23]. Establishing new systems that can gather the data required to estimate building energy performance requires multidisciplinary efforts and comes with high financial and time costs. Consequently, missing data at different levels in building energy modeling is a common problem. Most of the existing statistical and nonlinear machine learning methods can provide reliable interpolations when the missing rate is small, for example, lower than 30%, and the problem is considered to be random missing. When the missing rate is as large as 50–90% or if the sample information is completely missing from a large-scale dataset, it is unknown how well these methods will perform [24]. Alternatively, when a simplified building model handles incomplete data, its findings are usually fairly robust and efficient [25]. However, the assessment methodology has depended on the situation in each specific area, and it can be difficult to generalize [26].

#### **3. Methods for Acquiring Building Energy Performance Data**

Traditional methods for acquiring building energy performance data include direct collection from a meter, data augmentation and simulation. Combining these sources will enrich a dataset. Despite this, statistical methods, especially mixture models, from generative model point of view have not been seriously examined.

#### *3.1. Collecting Energy Performance Data*

Direct collection of building energy data from energy meters and sub-meters can be done in three different ways: energy readings, high-frequency energy logging and building management systems (BMS) [27]. Reading energy consumption, data are easy and cheap to do if meters are on-site, but readers can make mistakes and these are not easily discovered. One alternative is to apply computer vision techniques to automatically read the meter [28]. High-frequency energy data logging is the process of both collecting and storing data over a period of relatively short time intervals. The core device is a data logger comprising an electronic sensor, a processor and storage memory. The development of cloud and edge computing make the management of these data much smoother. These kinds of data are usually used for accurate forecasting in high time resolution and serves as the basis for a real-time bidding system. A BMS is a digital system that monitors, controls and records a building's operations. However, its automated processes, analysis capabilities and its integration with other systems are still not well developed. The data that BMS

systems provide have been shown to contain errors [27]. These challenges can be tackled with the aid of the modeling approaches and integration capabilities provided by building information modeling (BIM) [29].

Direct data collection from meters is fast and precise. Where there is uncertainty about energy performance, meter data can act as a system benchmark. With the help of computers, the handling of data is becoming increasingly efficient. However, errors, including missing and incorrect values, can still be made by humans and machines. Thus, investing in data infrastructures such as sensors, meters, sub-meters and data archiving systems and linking these to data-driven building operation applications are still essential [30].

#### *3.2. Data Augmentation*

Data augmentation is a particular technique in computer vision that allows images to be flipped, rotated, scaled and translated in order to produce more samples. New features can also be generated to aid model generalization. Although building energy data are not directly stored in the form of images, there have been a few studies that explore the different ways that data could be augmented.

#### 3.2.1. Energy Consumption

Unlike face recognition, intelligent transportation, precision medicine and other AI industries where computer vision has been comprehensively developed, an efficient, imagebased approach to analyzing building energy is still in its early stages [31]. Traditional data augmentation methods have seldom been considered, although a deep convolutional neural network is able to detect equipment usage and predict the energy heat gains in office buildings [32]. In other research, building energy consumption data incorporating equipment use, lighting and air conditioning systems were augmented and enabled to capture more sequences by imposing a sliding sample window on the original training data [33]. With this technique, the training sample was enlarged to the length of original sequence minus the window length.

#### 3.2.2. Short-Term Load Forecasting

Short-term load forecasting (STLF) for buildings, which secures operations in a smart grid, focuses on predicting the energy load of a building for a certain number of hours or days ahead. Deep learning has become the principal method for securing accurate predictions within this process from the demand side [34,35]. The inclusion of multi-source data such as user profiles, efficiency and weather data has been shown to improve the performance of loading forecasting [36]. Without a sufficient supply of data, deep learning modeling may fail to cope with a huge number of parameters. However, a large historical load dataset is very likely unavailable. One of the reasons is the development of distribution networks where newly built distribution transformers are growing. The other reason is that the responsible party for data management frequently restricts access to it [37].

One recent study proposed a method that merged the three-phase dataset to form a larger dataset so that the load forecasting of a particular phase could also be based on other phases' information. The new dataset was then thrown into a rolling forest to generate a training and testing set [37]. Another study proposed that, where a method that concatenates the series to generate larger amounts of data was viable, for a single building, the loading series should have less uncertain and more homogeneous information [38]. The size of the data was enlarged *K* <sup>2</sup> + *K* /2 times by figuring out 1, 2, . . . , *K* centroid load profiles and the corresponding residuals. In order to improve the concatenation method, instead of aggregating all of the historical data from previous years, a historical data augmentation method inserted one feature factor, which adopts adjacent loads as new feature, into the original input [39].

#### 3.2.3. Learning-Based Methods

Generative Adversarial Network (GAN) has been developed in many applications [40,41]. In GAN, a generator generates random objects that are mixed with real objects for a discriminator to discriminate. The discriminator learns to assign high scores to real objects and low scores to generated objects. The generator is then updated so that it generates new objects that are likely to be assigned a high score by the fixed discriminator. The procedure stops when the discriminator is not able to discriminate whether the objects are generated or real. In a recent work [42], GAN was applied to one year of hourly whole building electrical meter data from 156 office buildings so that the individual variations of each building were eliminated. The generated load profiles were close to the real ones suggesting that GAN can be further used to anonymize data, generate load profiles and verify other generation models. A recurrent GAN preceded by core features pre-processing was also used to test the same dataset. The model trained with the synthetic data achieved a similar accuracy as the one trained with real data [43]. In another work, a conditional variational auto-encoder was developed to detect electricity theft in buildings. The method considered the shapes and distribution characteristics of samples at the same time, with the training set improving detection performance in comparison with other classifiers [44].

#### 3.2.4. Simulation

Building simulation is an economical method for evaluating building performance and comparing it with real data [45,46]. With a pre-defined physical environment and generally acceptable levels of accuracy, simulation tools can rapidly generate sufficient amounts of analyzable data. Stochastic factors that incur uncertainties, such as occupant behaviors, have been integrated into building performance simulations and have improved their accuracy [47]. Combined with Geographical Information System (GIS) programs, urban energy simulations are already demonstrating that they are likely to further reduce input data uncertainty and simplify city district modeling [48]. Some of challenges of building simulation, such as model calibration, efficient technology adoption and integrated modeling and simulation have also been addressed in various modeling scales, from the single building to the national level, and at different stages in the building life cycle, from initial design to retrofit [49]. For building simulation, the applicability of integrated tools, not only during the early design but also throughout the building operation, management and retrofit stages should be improved to make the most effective decisions [50].

#### **4. The Gaussian Mixture Model and Expectation-Maximization Method**

#### *4.1. Gaussian Mixture Model*

Building energy data are often a mixture of samples from different populations where the parameters differ from each other. A natural strategy is to identify these populations and generate new data points using respective populations where a Gaussian mixture model (GMM) is built. GMM is a simple linear superposition of Gaussian components, aimed at providing a richer class of density models than the single Gaussian [51]. GMM is also a generative model, wherein arbitrary amounts of data can be generated. While *K*-means, a special form of GMM, assigns an individual to the nearest center, GMM gives a soft allocation for all data points. GMM is an unsupervised method where a data point is assumed to belong to a component with a certain probability. A categorical latent variable *<sup>Z</sup>* is adopted to determine a specific component by letting *<sup>Z</sup><sup>k</sup>* = <sup>1</sup> and *<sup>Z</sup>*−*<sup>k</sup>* = 0, where *<sup>Z</sup>*−*<sup>k</sup>* is the elements other than *Z<sup>k</sup>* . The marginal distribution of *Z* is denoted as *P*(*Z<sup>k</sup>* = 1) = *π<sup>k</sup>* with ∑ *π<sup>k</sup>* = 1 for *k* = 1, 2, . . . , *K*. Thus, in Equation (1), the marginal distribution for the observable variable *X* will be

$$p(\mathbf{x}) = \sum\_{k=1}^{K} \pi\_k \mathcal{N}(\mathbf{x}|\mu\_{k\prime}\Sigma\_k),\tag{1}$$

where N (*x*|*µk*, **Σ***<sup>k</sup>* ) is the Gaussian density. A posterior probability for *Z<sup>k</sup>* when *x* is given indicates how much each component contributes to the realization of *x*: ୀଵ where ࣨ)|ࣆ, (is the Gaussian density. A posterior probability for ܼ when is

( ,ࣆ|)ࣨߨ= ()

*Buildings* **2021**, *11*, 30 6 of 20

$$\gamma(k|\ ) = p(Z\_k = 1|\mathbf{x}, \ \mu, \ \mathbf{\Sigma}) = \frac{\pi\_k \mathcal{N}(\mathbf{x}|\mu\_k, \Sigma\_k)}{\sum\_{j=1}^{K} \pi\_j \mathcal{N}(\mathbf{x}|\mu\_j, \Sigma\_j)} \tag{2}$$

, (1)

*γ*( *k*|) in Equation (2) is also known as the responsibility probability allowing us to partition an observation into *K* components. For a given set of independent observations **<sup>1</sup>**, **<sup>2</sup>**, . . . , *<sup>N</sup>* with sample size *N*, we assume that z*n*, *n* = 1, 2, . . . , *N*, is the latent variable for each observation. In addition to the location and shape parameters *µ<sup>k</sup>* and **Σ***k*, the marginal probabilities *π*1, *π*2, . . . , *π<sup>k</sup>* also contribute the parameter space of the log-likelihood function in Equation (3): ߛ)݇|х) in Equation (2) is also known as the responsibility probability allowing us to partition an observation into ܭ components. For a given set of independent observations х,х,…,х with sample size ܰ, we assume that ᴢ, ݊ = 1, 2, ... , ܰ, is the latent variable for each observation. In addition to the location and shape parameters ࣆ and , the marginal probabilities ߨଵ, ߨଶ,…,ߨ also contribute the parameter space of the log-likelihood function in Equation (3):

$$\mathcal{L}(\boldsymbol{\pi}, \ \boldsymbol{\mu}, \ \boldsymbol{\Sigma}) = \ln p(\mathbf{x}|\boldsymbol{\pi}\_{\mathbf{k}}, \boldsymbol{\mu}\_{\mathbf{k}}, \boldsymbol{\Sigma}\_{\mathbf{k}}) = \sum\_{n=1}^{N} \ln \left[ \sum\_{k=1}^{K} \pi\_{\mathbf{k}} \mathcal{N}(\boldsymbol{\mu}|\boldsymbol{\mu}\_{\mathbf{k}}, \boldsymbol{\Sigma}\_{\mathbf{k}}) \right] \tag{3}$$

The graphical representation for an observation *<sup>n</sup>* can be illustrated in Figure 1. The explicit form of the derivatives to Equation (3) is not available due to the summation term in the logarithm operation. The EM algorithm introduced in Sections 4.2 and 4.3 will address this difficulty and evaluate its likelihood in a tractable way. The graphical representation for an observation х can be illustrated in Figure 1. The explicit form of the derivatives to Equation (3) is not available due to the summation term in the logarithm operation. The EM algorithm introduced in Sections 4.2 and 4.3 will address this difficulty and evaluate its likelihood in a tractable way.

**Figure 1.** Graphical representation of a set of data points. **Figure 1.** Graphical representation of a set of data points.

#### *4.2. The Expectation-Maximization (EM) Algorithm 4.2. The Expectation-Maximization (EM) Algorithm*

It has been proposed that an EM algorithm might be used to iteratively compute maximum-likelihood with incomplete information, where many applications such as filling in missing data, grouping and censoring, variance components, hyperparameter estimation, reweighted least-squares and factor analysis are explicitly addressed [52]. EM is considered one of the top ten algorithms in data mining and simplifies the maximization procedures for GMM [53]. Density estimation for GMM via EM can also handle high-dimensional data [54]. Thus, for a parametrized probability model ܲ(X|ࣂ(, the joint distribution ܲ(X, Z|ࣂ (is introduced to rewrite the log-likelihood as It has been proposed that an EM algorithm might be used to iteratively compute maximum-likelihood with incomplete information, where many applications such as filling in missing data, grouping and censoring, variance components, hyperparameter estimation, reweighted least-squares and factor analysis are explicitly addressed [52]. EM is considered one of the top ten algorithms in data mining and simplifies the maximization procedures for GMM [53]. Density estimation for GMM via EM can also handle highdimensional data [54]. Thus, for a parametrized probability model *P*(X|*θ*), the joint distribution *P*(X, Z|*θ*) is introduced to rewrite the log-likelihood as

$$\mathcal{L}(\boldsymbol{\theta}) = \ln \mathbf{P}(\boldsymbol{\lambda}|\boldsymbol{\theta}) = \ln \mathbf{P}(\mathbf{X}, \mathbf{Z}|\boldsymbol{\theta}) - \ln \mathbf{P}(\mathbf{Z}|\mathbf{X}, \mathbf{\theta}), \tag{4}$$

where X is the observable variable and Z is the hidden variable. It should be noted that ݈݊ ܲ(X|ࣂ (is given in the form of a random variable. It will be equivalent to where X is the observable variable and Z is the hidden variable. It should be noted that *lnP*(X| *θ*) is given in the form of a random variable. It will be equivalent to ∑ *N i*=1 *lnP*(*<sup>i</sup>* |*θ*) when the sample <sup>1</sup>, <sup>2</sup>, . . . , *<sup>N</sup>* is obtained. We denote a density function of Z as *q*(Z) with *q*(Z) > 0. Since *q*(Z) is irrelevant to *θ*, taking an expectation to *lnP*(X| *θ*) with regard to the

distribution of Z will not affect the value of L(*θ*). On the other hand, taking the expectation to the right-hand side in Equation (4) results in

$$\mathcal{L}(\boldsymbol{\theta}) = \int\_{\mathbf{Z}} q(\mathbf{Z}) \ln \frac{\mathbf{P}(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})}{q(\mathbf{Z})} d\mathbf{Z} - \int\_{\mathbf{Z}} q(\mathbf{Z}) \ln \frac{\mathbf{P}(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta})}{q(\mathbf{Z})} d\mathbf{Z}.\tag{5}$$

In Equation (5), − R Z *q*(Z)*ln <sup>P</sup>*( <sup>Z</sup>|X ,*θ*) *q*(Z) *d*Z, known as the Kullback–Leibler divergence and denoted as *KL*(*q* k *P*), is used to measure the distance between *q*(Z) and *P*(Z|X, *θ*) [55]. *KL*(*q* k *P*) takes the value of 0 when *q*(Z) = *P*(Z|X, *θ*), otherwise it is greater than 0. If we denote R Z *q*(Z)*ln <sup>P</sup>*(X, Z<sup>|</sup> *<sup>θ</sup>*) *q*(Z) *d*Z as the evidence lower bound (ELBO), L(*θ*) can be represented in Equation (6) as

$$\mathcal{L}(\mathfrak{g}) = ELBO + KL(q \parallel P). \tag{6}$$

For fixed *θ* and <sup>1</sup>, <sup>2</sup>, . . . , *<sup>N</sup>*, L(*θ*) ≥ *ELBO*. L(*θ*) takes the value of its lower bound ELBO only when *KL*(*q* k *P*) = 0. Thus, the task becomes to find the estimate of *θ* such that

$$\begin{split} \dot{\theta}^{(t+1)} &= \arg\max\_{\theta} \text{ELO} \\ \theta &= \underset{\theta}{\text{argmax}} \int\_{Z} P\Big(\mathbb{Z}|\mathcal{X}, \theta^{(t)}\Big) \Big[ \ln P(\mathbb{X}, \mathbb{Z}|\theta) - \ln P\Big(\mathbb{Z}|\mathcal{X}, \theta^{(t)}\Big) \Big] d\mathcal{Z} \\ \theta &= \underset{\theta}{\text{argmax}} \,\mathbb{E}\_{P(\mathbb{Z}|\mathbb{X}, \theta^{(t)})} [\ln P(\mathbb{X}, \mathbb{Z}|\theta)]. \end{split} \tag{7}$$

*θ* (*t*) is fixed for *P* Z|X, *θ* (*t*) that is one of the specific options for *q*(Z) in Equation (7). The superscript (*t*) indicates the values obtained from the last iteration of the EM algorithm. Hence, R Z *P* Z|X, *θ* (*t*) *lnP* Z|X, *θ* (*t*) *d*Z is independent of *θ* and can be treated as constant in the estimation. It should be noted that *P*(Z|X, *θ*) is the ideal representation of the form of *q*(Z) in some specific models. For implicit *q*(Z) that is hard to obtain, an approximation method is usually applied to find the best *q*(Z).

Two recursive steps for updating *θ*ˆ(*t*+1) in Equation (7) form the EM algorithm. In the E-step, we use old *θ* (*t*) to find the posterior distribution for the latent variable, which is used to calculate the expectation of complete-data likelihood:

$$\mathcal{H}\left(\boldsymbol{\theta};\boldsymbol{\theta}^{(t)}\right) = \mathbb{E}\_{P(\mathbf{Z}|\mathbf{X},\boldsymbol{\theta}^{(t)})} [\ln P(\mathbf{X},\mathbf{Z}|\boldsymbol{\theta})].\tag{8}$$

In the M-step, *θ* (*t*+**1**) is updated by maximizing Equation (8):

$$\boldsymbol{\hat{\theta}}^{(t+1)} = \arg\max\_{\boldsymbol{\Theta}} \mathcal{H}\left(\boldsymbol{\theta}; \boldsymbol{\theta}^{(t)}\right). \tag{9}$$

An iterative evaluation of Equations (8) and (9) guarantees the convergence of L(*θ*) [56]. An illustration of this process can be seen in Figure 2. The M-step searches the new parameter to increase the value of H(*θ*; •) that is further increased by plugging to L(*θ*) because the property L(*θ*) ≥ *ELBO* always holds. The convergence will be found until *θ* does not significantly update its value or L(*θ*) does not improve.

**Figure 2.** Parameter update in the EM algorithm. **Figure 2.** Parameter update in the EM algorithm.

#### *4.3. Parameter Estimation for GMM 4.3. Parameter Estimation for GMM*

ated by

ߨ

ℋ൫ࣂ ;ࣂ)(൯ is evaluated on the joint log-likelihood with complete data, which differs from the incomplete log-likelihood described in Equation (3). If we let the distribution of latent variable ܼ be ܲ(ܼ) = ∏ ߨ ௭ೖ ୀଵ and the conditional distribution be ܲ(ܺ|ܼ) = ∏ ࣨ)|ࣆ, (௭ೖ ୀଵ , the likelihood for complete data is a product of the two distributions: ܲ൫ х, ܢห, ࣆ, ൯ = ෑ ෑ ߨ ௭ೖࣨ൫ х|ࣆ, ൯ ௭ೖ ே , (10) H *θ*; *θ* (*t*) is evaluated on the joint log-likelihood with complete data, which differs from the incomplete log-likelihood described in Equation (3). If we let the distribution of latent variable *Z* be *P*(*Z*) = *K* ∏ *k*=1 *π zk k* and the conditional distribution be *P*(*X*|*Z*) = *K* ∏ *k*=1 N (*x*|*µk*, **Σ***<sup>k</sup>* ) *zk* , the likelihood for complete data is a product of the two distributions:

$$P(\mathbf{\\_\_{n'}}|\mathbf{\underline{\mathbf{z}}},\mathbf{\underline{\mathbf{z}}}|\boldsymbol{\pi},\ \boldsymbol{\mu},\ \boldsymbol{\Sigma}) = \prod\_{n=1}^{N} \prod\_{k=1}^{K} \pi\_k^{\mathbb{Z}uk} \mathcal{N}(\ \boldsymbol{\pi}|\boldsymbol{\mu}\_{k},\boldsymbol{\Sigma}\_{k})^{\mathbb{Z}uk} \tag{10}$$

the latent variable ܼ and the observed variable, namely ሿ) ,ࣆ|х( ݈ࣨ݊ + ߨ݈݊ሾݖ = ( ,ࣆ ,|ܢ ,х( ܲ ݈݊ ୀଵ ே ୀଵ . (11) where *znk* indicates the *k th* component for *<sup>n</sup>*. Compared with Equation (3), the benefit for evaluating Equation (10) is that the logarithm part of *lnP*( *<sup>n</sup>*, **z***n*|*π*, *µ*, **Σ**) will not be calculated on any summation terms, and there will only be a linear relationship between the latent variable *Z* and the observed variable, namely

Based on Equation (11), ℋ൫ࣂ, ࣂ)(൯ can be easily specified as () ; ,ࣆ ,ℋ൫ ()ࣆ , () , ൯=ॱ൬ܢฬ х, )( ()ࣆ , () , ൰ ൯൧ ,ࣆ ,หܢ ,х ൫ܲ ൣ݈݊ *lnP*( *<sup>n</sup>*, **z***n*|*π*, *µ*, **Σ**) = *N* ∑ *n*=1 *K* ∑ *k*=1 *znk*[*lnπ<sup>k</sup>* + *ln*N ( *<sup>n</sup>*|*µk*, **Σ***k*)]. (11)

൯ሿ ,ࣆ|х ൫݈ࣨ݊ + ߨ݈݊ሾߛ = ே , Based on Equation (11), H *θ*, *θ* (*t*) can be easily specified as

$$\begin{aligned} \mathcal{H}\left(\boldsymbol{\pi},\,\mu,\,\Sigma;\,\pi^{(t)},\,\mu^{(t)},\,\Sigma^{(t)}\right) &= \mathbb{E}\_{\begin{subarray}{c} \mathsf{P}\left(\mathsf{z}\_{\mathsf{u}} \mid \mathsf{a},\mathsf{z}^{(t)},\,\mu^{(t)},\,\Sigma^{(t)}\right) \Big[} \left[\ln \mathsf{P}\left(\mathsf{z}\_{\mathsf{u}} \mid \mathsf{z}\_{\mathsf{u}} \,\middle|\, \mathsf{z}\_{\mathsf{u}} \,\middle|\, \mathsf{z}\_{\mathsf{u}} \right) \right] \\ = \sum\_{n=1}^{N} \sum\_{k=1}^{K} \gamma\_{nk} [\ln \pi\_{k} + \ln \mathsf{N}(\,\mathsf{a} \,\middle|\, \mathsf{z}\_{k} \,\middle|\, \mathsf{z}\_{k} \,\middle|\, \mathsf{z}\_{k} \right], \end{aligned} \tag{12}$$

൯ ,ࣆห൫хࣨߨ = ൯หх݇൫ߛ = ߛ ൯ ,ࣆห൫хࣨߨ ∑ . (13) where *γnk* = E[*znk*] is the responsibility probability for a given *<sup>n</sup>* to be partitioned into component *k*. Thus, the specification of *γ*(*k*|) in Equation (2), *γnk*, can be evaluated by

ୀଵ

$$\gamma\_{nk} = \gamma(k|\_{\mathfrak{u}}) = \frac{\pi\_k \mathcal{N}(\_{\mathfrak{u}}|\mu\_{\mathfrak{k}}, \Sigma\_{\mathfrak{k}})}{\sum\_{j=1}^{K} \pi\_j \mathcal{N}(\_{\mathfrak{u}}|\mu\_{j\mathfrak{k}}, \Sigma\_{\mathfrak{j}})}. \tag{13}$$

(12)

 . Finally, the E-step for GMM is to evaluate the log-likelihood given ࣂ)= ( ()ሼ ()ࣆ , () , ሽ and the M-step updates the parameters: The maximization of Equation (12) is in a tractable form if taking derivatives to the parameters. Due to the constraint ∑ *π<sup>k</sup>* = 1, a Lagrange multiplier *λ* is introduced for *π<sup>k</sup>* . Finally, the E-step for GMM is to evaluate the log-likelihood given *θ* (*t*) <sup>=</sup> {*π*(*t*) , *µ* (*t*) , **Σ** (*t*) } and the M-step updates the parameters:

$$\begin{array}{l} \boldsymbol{\mu}\_{\mathbf{k}}^{(\mathbf{t}+\mathbf{1})} = \frac{1}{\sum\_{n=1}^{N} \gamma\_{nk}} \sum\_{n=1}^{N} \gamma\_{nk} \boldsymbol{\mu}\_{\mathbf{k}} \\ \boldsymbol{\Sigma}\_{\mathbf{k}}^{(\mathbf{t}+\mathbf{1})} = \frac{1}{\sum\_{n=1}^{N} \gamma\_{nk}} \sum\_{n=1}^{N} \gamma\_{nk} (\boldsymbol{\mu} - \boldsymbol{\mu}\_{\mathbf{k}}^{(\mathbf{t}+\mathbf{1})}) (\boldsymbol{\mu} - \boldsymbol{\mu}\_{\mathbf{k}}^{(\mathbf{t}+\mathbf{1})})^{T} ; \\ \boldsymbol{\pi}\_{\mathbf{k}}^{(\mathbf{t}+\mathbf{1})} = \frac{\sum\_{n=1}^{N} \gamma\_{nk}}{N} .\end{array}$$

When all the parameters are updated, we return to the E-step and evaluate Equation (13). The terminal of the algorithm, as introduced in Section 4.2, is reached either by observing stable *θ* or L(*θ*).

#### **5. Data and Performance**

#### *5.1. Test Datasets*

This paper considers two different datasets to validate the proposed method. Both datasets are well organized and free to use. The first one is the public building energy and water use data from Boston in the United States (the data are available at https: //data.boston.gov/dataset/building-energy-reporting-and-disclosure-ordinance). The dataset was collected according to the Building Energy Reporting and Disclosure Ordinance (BERDO), which allows different types of building stakeholders to track their energy usage and greenhouse gas emissions and provides an assessment source for local government to implement energy-saving policies. In the original file from 2019 there were 28 variables in total. This includes general variables related to categories such as building name, location, physical information and energy-related variables. The variable *Energy Use Intensity* (EUI) reflects total annual energy consumption divided by the gross floor area. EUI is an effective measurement of energy performance. By eliminating the missing values and outliers from the Boston dataset, we identified 659 buildings labeled as multifamily housing.

The second dataset was collected in Aarhus (Århus in Denish), Denmark in 2020 in order to examine the district heating (DH) efficiency of different Danish building types (the data are available at https://data.mendeley.com/datasets/v8mwvy7p6r/1) [57]. From this dataset we extracted the EUI data for multifamily housing built between the 1960s and the 1980s and identified 183 buildings. The mean values of EUIs for Boston and Aarhus were 66.68 and 130.46 respectively. Given the two samples now comprised homogeneous information, we merged them into one and assumed the number of populations to be two. Thus, univariate Gaussian distribution was considered.

In addition, our calculations took the variable age group from the Danish dataset, representing the period when the building was built, as a new population indicator to illustrate the bivariate case. The segmentation was determined by shifts in Danish building traditions and the tightening of energy requirements in Danish Building Regulations. Two specific age groups were chosen to constitute the populations: '1951–1960' with 3461 buildings and 'After 2015' with 927. We then selected a secondary variable, *Ga*, to measure the daily heat load variation of these buildings since we postulated that the load variation may form a representative feature for both populations. Ga is calculated as the accumulated positive difference between the hourly and daily average heat loads during a year divided by both the annual average heat load and the number of hours (8760) in the year. Most of the values of Ga were around 0.2 indicating 20% load deviations on average. Thus, two populations (two age groups) and two variables (EUI and Ga) were constructed only from the Danish dataset for the bivariate case.

#### *5.2. The Performance of EM Algorithm*

#### 5.2.1. The Univariate Case

The histograms are plotted to present the mixed and grouped distribution of EUIs for both cities. In the left panel of Figure 3, the mixed distribution created two peaks,

although there was no obvious separation in the overlapped part. The true distributions can, however, be seen quite clearly in the grouped (separated) distribution displayed in the right panel of Figure 3. One limitation when using a Gaussian distribution is that the value of EUI cannot be negative. A truncated Gaussian may be a more appropriate representation. However, since all the truncated data points belonged to the Boston population, using GMM will not affect the responsibility probability *γnk* when conducting the E-step. Thus, we still follow the Gaussian assumption. If Gaussian property is severely violated, for example, because of the heat load variation in the bivariate variables, a Box-Cox transformation is required before implementing EM. hough there was no obvious separation in the overlapped part. The true distributions can, however, be seen quite clearly in the grouped (separated) distribution displayed in the right panel of Figure 3. One limitation when using a Gaussian distribution is that the value of EUI cannot be negative. A truncated Gaussian may be a more appropriate representation. However, since all the truncated data points belonged to the Boston population, using GMM will not affect the responsibility probability ߛ when conducting the E-step. Thus, we still follow the Gaussian assumption. If Gaussian property is severely violated, for example, because of the heat load variation in the bivariate variables, a Box-Cox transformation is required before implementing EM.

The histograms are plotted to present the mixed and grouped distribution of EUIs for both cities. In the left panel of Figure 3, the mixed distribution created two peaks, alt-

*Buildings* **2021**, *11*, 30 10 of 20

*5.2. The Performance of EM Algorithm* 

5.2.1. The Univariate Case

**Figure 3.** Mixed and grouped distribution of EUI. **Figure 3.** Mixed and grouped distribution of EUI.

Another issue for the parameter estimation is to determine the initial values of ࣂ. It is not a difficult task to compare several combinations for the univariate case, but it will become a problem when the parameter size is large. Thus, we tested a number of possible combinations by varying the initial choice of ߨଵ, ߨଶ, ߤଵ, ߤଶ, ߪଵ, ߪଶ and applied the same scheme to the bivariate case. More details on this process can be found in previous work on the setting of initial values [58]. The choice was determined by considering whether the final estimation could well represent both populations. The empirical results showed that the choices of ߨଵ, ߨଶ, ߪଵ, ߪଶ did not seem to be dominant for the convergence, and we simply took ߨଵ = ߨଶ = 0.5 and ߪଵ, ߪଶ to be the sample standard deviations. On the other hand, close values for ߤଵ and ߤଶ failed to separate the populations. In most of the experiments, ߤଵ and ߤଶ converged to a single value. Thus, we initialized ߤଵ and ߤଶ by letting them be constrained in the upper and lower 1/3 quantile of the mixed population, respectively. Further, one hundred initial values for ߤଵ and one hundred for ߤଶ were randomly generated to obtain ten thousand combinations in which we randomly opted ten for evaluating the performance. Another issue for the parameter estimation is to determine the initial values of *θ*. It is not a difficult task to compare several combinations for the univariate case, but it will become a problem when the parameter size is large. Thus, we tested a number of possible combinations by varying the initial choice of *π*1, *π*2, *µ*1, *µ*2, *σ*1, *σ*<sup>2</sup> and applied the same scheme to the bivariate case. More details on this process can be found in previous work on the setting of initial values [58]. The choice was determined by considering whether the final estimation could well represent both populations. The empirical results showed that the choices of *π*1, *π*2, *σ*1, *σ*<sup>2</sup> did not seem to be dominant for the convergence, and we simply took *π*<sup>1</sup> = *π*<sup>2</sup> = 0.5 and *σ*1, *σ*<sup>2</sup> to be the sample standard deviations. On the other hand, close values for *µ*<sup>1</sup> and *µ*<sup>2</sup> failed to separate the populations. In most of the experiments, *µ*<sup>1</sup> and *µ*<sup>2</sup> converged to a single value. Thus, we initialized *µ*<sup>1</sup> and *µ*<sup>2</sup> by letting them be constrained in the upper and lower 1/3 quantile of the mixed population, respectively. Further, one hundred initial values for *µ*<sup>1</sup> and one hundred for *µ*<sup>2</sup> were randomly generated to obtain ten thousand combinations in which we randomly opted ten for evaluating the performance.

The ten sets of parameters are summarized in Table 1. For every parameter, there was no significant variation, and the mean values can be used to represent ߤ̂ <sup>ଵ</sup> and ߤ̂ ଶ. We also computed the absolute errors in percentage terms between the mean and true values. Most of them were within 5%, while the overestimation of ߪଶ for Aarhus might The ten sets of parameters are summarized in Table 1. For every parameter, there was no significant variation, and the mean values can be used to represent *µ*ˆ<sup>1</sup> and *µ*ˆ2. We also computed the absolute errors in percentage terms between the mean and true values. Most of them were within 5%, while the overestimation of *σ*<sup>2</sup> for Aarhus might be due to the slightly smaller estimation for *µ*2. Unlike fixed initial values, we also allowed for random variations up to 25% for *σ*<sup>1</sup> and *σ*<sup>2</sup> to validate our argument. As Table 2 shows, the result resembled Table 1. The same conclusion could be drawn for *π*<sup>1</sup> and *π*2, which are not shown here. It is also observed that the performance of the log-likelihood values in Figure 4 is uniform. All of the experiments stopped within 15 updates and seemed

to converge at the same point. In other words, it is enough to make inferences based on current estimations. values in Figure 4 is uniform. All of the experiments stopped within 15 updates and seemed to converge at the same point. In other words, it is enough to make inferences based on current estimations.

be due to the slightly smaller estimation for ߤଶ. Unlike fixed initial values, we also allowed for random variations up to 25% for ߪଵ and ߪଶ to validate our argument. As Table 2 shows, the result resembled Table 1. The same conclusion could be drawn for ߨଵ and ߨଶ, which are not shown here. It is also observed that the performance of the log-likelihood


4.55% 0.22 0.23 0.24 0.22 ଶߨ

**Table 1.** Summary of the parameter estimation for fixed variance. **Table 1.** Summary of the parameter estimation for fixed variance.

*Buildings* **2021**, *11*, 30 11 of 20

**Table 2.** Summary of the parameter estimation for random variance. **Table 2.** Summary of the parameter estimation for random variance.


**Figure 4.** Log-likelihood performance for the univariate case. **Figure 4.** Log-likelihood performance for the univariate case.

We created two scenarios and assigned the sample points to the population with the larger density for classification. Two corresponding confusion matrices are presented in Table 3. We divided all the data points into four categories: true Boston, false Boston, true Aarhus and false Aarhus. The accuracy is the sum of both true classifications that is close to 90%. We created two scenarios and assigned the sample points to the population with the larger density for classification. Two corresponding confusion matrices are presented in Table 3. We divided all the data points into four categories: true Boston, false Boston, true Aarhus and false Aarhus. The accuracy is the sum of both true classifications that is close to 90%.



We then demonstrated the fitness between theoretical and empirical proportions. Proportion here refers to the quotient for which Boston's EUI should theoretically and empirically account. Theoretical proportion is made by Aarhus 5.34% 14.61% 4.87% 14.01% We then demonstrated the fitness between theoretical and empirical proportions. Proportion here refers to the quotient for which Boston's EUI should theoretically and

**Population (Predicted) True Population, Fixed True Population, Random** 

Boston 72.92% 7.13% 73.40% 7.72%

**Boston Aarhus Boston Aarhus** 

,

*Buildings* **2021**, *11*, 30 12 of 20

**Table 3.** Classification accuracy for the univariate case.

$$\mathcal{P}\_{th}(\text{Boston}) = \frac{0.77 \mathcal{N}(\tilde{\mathbf{x}}|66.71, \text{ 31.39})}{0.77 \mathcal{N}(\tilde{\mathbf{x}}|66.71, \text{ 31.39}) + 0.23 \mathcal{N}(\tilde{\mathbf{x}}|128.33, \text{ 39.38})},$$

where <sup>e</sup>*<sup>x</sup>* corresponds to the probability quantile segmentation on the x-axis in Figure <sup>5</sup> for the mixed distribution. Similarly, the empirical proportion, P*em*(*Boston*), counts the number of data points from Boston divided by all data points between two adjacent values of <sup>e</sup>*x*. For example, if 100% of the observations are taken from Boston, <sup>P</sup>*th*(*Boston*) should be extremely close to 1 at the quantile 0. Both P*th*(*Boston*) and P*em*(*Boston*) are supposed to decrease because *µ*ˆ<sup>2</sup> is greater than *µ*ˆ1. Thus, the results showed a good fit except for the quantiles close to 1 on the bottom right corner. This is because of the long tail of Boston's EUI. However, there were only a total of eight data points in this area, which is a minor deviation compared with P*th*(*Boston*). The performance for Aarhus would look exactly the same just by vertically reversing Figure 5. where corresponds to the probability quantile segmentation on the x-axis in Figure 5 for the mixed distribution. Similarly, the empirical proportion, ࣪)ݐݏܤ݊(, counts the number of data points from Boston divided by all data points between two adjacent values of . For example, if 100% of the observations are taken from Boston, ࣪௧(ݐݏܤ݊( should be extremely close to 1 at the quantile 0. Both ࣪௧(ݐݏܤ݊ (and ࣪)ݐݏܤ݊ (are supposed to decrease because ߤ̂ <sup>ଶ</sup> is greater than ߤ̂ <sup>ଵ</sup>. Thus, the results showed a good fit except for the quantiles close to 1 on the bottom right corner. This is because of the long tail of Boston's EUI. However, there were only a total of eight data points in this area, which is a minor deviation compared with ࣪௧(ݐݏܤ݊(. The performance for Aarhus would look exactly the same just by vertically reversing Figure 5.

**Figure 5.** Theoretical and empirical proportions. **Figure 5.** Theoretical and empirical proportions.

Given the results, we have drawn an additional two-step random sampling from the distributions to examine the generated EUIs. In step 1, we created a vector to store a random sequence of 0s and 1s that imply to which population a generated point belongs. The probabilities are taken as ܲ(ܫ=0 = (0.77 and ܲ(ܫ=1 = (0.23, where ܫ is the indicator. The length of the vector is the same as the one of the observed sample, namely 842. In step 2, Gaussian random samples were drawn for each population to constitute the generated data. The quantile–quantile plot is shown for both fixed and random initial ߪs. As seen in Figure 6, the quantile values were taken every 5%. It is not surprising that neither setting for initial ߪ differed a great deal, and almost all of the quantile values were located on the 45-degree line, which means that the quantile values matched each other. In this sense, the generated sample under GMM presented a reliable representation of the populations. Here we only used the true sample to validate a generated sample of the same size. When the method is used in an authentic context, the sample size will depend on the total number of buildings in the original cohort. Given the results, we have drawn an additional two-step random sampling from the distributions to examine the generated EUIs. In step 1, we created a vector to store a random sequence of 0s and 1s that imply to which population a generated point belongs. The probabilities are taken as *P*(*I* = 0) = 0.77 and *P*(*I* = 1) = 0.23, where *I* is the indicator. The length of the vector is the same as the one of the observed sample, namely 842. In step 2, Gaussian random samples were drawn for each population to constitute the generated data. The quantile–quantile plot is shown for both fixed and random initial σs. As seen in Figure 6, the quantile values were taken every 5%. It is not surprising that neither setting for initial *σ* differed a great deal, and almost all of the quantile values were located on the 45-degree line, which means that the quantile values matched each other. In this sense, the generated sample under GMM presented a reliable representation of the populations. Here we only used the true sample to validate a generated sample of the same size. When the method is used in an authentic context, the sample size will depend on the total number of buildings in the original cohort.

**Figure 6.** Quantile–quantile plot for observed and generated data. **Figure 6.** Quantile–quantile plot for observed and generated data.

#### 5.2.2. The Bivariate Case

5.2.2. The Bivariate Case Testing bivariate case requires more parameters to be estimated. The selection of the initial values followed the same paradigms that were adopted for the univariate case. In order to keep the Gaussian property, as mentioned in Section 5.2.1, the daily heat load variation (Ga) was treated by Box-Cox transformation [59]. Since the estimations then became complex, we also increased the number of experiments for determining the estimates. We picked 20 initial values from each of the population means: 1 , 1 , 2 and 2 . The number of combinations became 20<sup>4</sup> = 160,000. In all the experiments, we highlighted the combinations with a log-likelihood in the top 10%. We present the resulting pattern in Figure 7 where 400 combinations were located on the x-axis for the population '1951–1960', while 400 for the population 'After 2015' were located on the y-axis. The combinations were arranged from {minimum EUI, minimum Ga} to {minimum EUI, maximum Ga} and then to {maximum EUI, maximum Ga}. The figure shows that there were slight periodic patterns among the bigger 20 × 20 grids. Higher log-likelihood was slightly denser in the top left part. In almost all the bigger 20 × 20 grids, however, high log-likelihood could always be found. Thus, the selection of initial values for the EM algorithm in the bivariate case appears to be somewhat isotropic at finding estimates with Testing bivariate case requires more parameters to be estimated. The selection of the initial values followed the same paradigms that were adopted for the univariate case. In order to keep the Gaussian property, as mentioned in Section 5.2.1, the daily heat load variation (Ga) was treated by Box-Cox transformation [59]. Since the estimations then became complex, we also increased the number of experiments for determining the estimates. We picked 20 initial values from each of the population means: *µEU I*1, *µGa*<sup>1</sup> , *µEU I*<sup>2</sup> and *µGa*<sup>2</sup> . The number of combinations became 20<sup>4</sup> = 160, 000. In all the experiments, we highlighted the combinations with a log-likelihood in the top 10%. We present the resulting pattern in Figure 7 where 400 combinations were located on the x-axis for the population '1951–1960', while 400 for the population 'After 2015' were located on the y-axis. The combinations were arranged from {minimum EUI, minimum Ga} to {minimum EUI, maximum Ga} and then to {maximum EUI, maximum Ga}. The figure shows that there were slight periodic patterns among the bigger 20 × 20 grids. Higher log-likelihood was slightly denser in the top left part. In almost all the bigger 20 × 20 grids, however, high log-likelihood could always be found. Thus, the selection of initial values for the EM algorithm in the bivariate case appears to be somewhat isotropic at finding estimates with high log-likelihood values.

high log-likelihood values. Something similar happened when we summarized the results of the 10% experiments in Table 4. In the bivariate case the overall errors decreased significantly compared with the univariate case. The majority were now below 3%. The reason for the bivariate model's success might be that it is able to use more of the energy performance features to separate the populations more correctly. It should be noted that the estimated *µ*ˆ*Ga* was not equal to the transformed mean value of Ga because the Box-Cox transformation is nonlinear. Thus, we only show the results for the transformed values here. We present both transformed and non-transformed Ga in the generative model evaluation later on. The classification accuracy is further computed in concordance with the univariate case in Table 5. Given the better parameter estimations, the true accuracy was over 99%.

*Buildings* **2021**, *11*, 30 14 of 20

**Figure 7.** Distribution of top 10% log-likelihood for the combination of the initial values. **Figure 7.** Distribution of top 10% log-likelihood for the combination of the initial values.


After 2015 ߤாூଶ 54.12 54.24 54.19 54.86 1.22%

ments in Table 4. In the bivariate case the overall errors decreased significantly compared with the univariate case. The majority were now below 3%. The reason for the bivariate **Table 4.** Summary of the parameter estimation in the bivariate case.

**Table 5.** Classification accuracy for the bivariate case.


Something similar happened when we summarized the results of the 10% experi-

The estimates obtained in Table 4 were used to generate density contours, with dense and sparse areas distinguished in the two-dimensional surface. We compared these with the true distributions because they disclose the real scales of the densities. The contours

are displayed in the left panel of Figure 8, and show the comparison that is made for the transformed heat load variation, while the result of the non-transformed distributions is shown in the right panel. The generative models are supposed to characterize the distributions in both dense and sparse areas. Both panels had obvious and observable centers. Both of these centers converged at the densest part of the real data. In other words, the generative models represented the real data to an acceptable degree. As discussed in the univariate case, the actual number of generated samples depends on the city's capacity to evaluate its energy performance with insufficient data. are displayed in the left panel of Figure 8, and show the comparison that is made for the transformed heat load variation, while the result of the non-transformed distributions is shown in the right panel. The generative models are supposed to characterize the distributions in both dense and sparse areas. Both panels had obvious and observable centers. Both of these centers converged at the densest part of the real data. In other words, the generative models represented the real data to an acceptable degree. As discussed in the univariate case, the actual number of generated samples depends on the city's capacity to evaluate its energy performance with insufficient data.

The estimates obtained in Table 4 were used to generate density contours, with dense and sparse areas distinguished in the two-dimensional surface. We compared these with the true distributions because they disclose the real scales of the densities. The contours

*Buildings* **2021**, *11*, 30 15 of 20

**Agr group (Predicted) True Population** 

1951–1960 78.56% 0.20% After 2015 0.32% 20.92%

**Table 5.** Classification accuracy for the bivariate case.

ߪாூଶ

ଶீߪ

ߤீଶ(transformed) −0.59 −0.59 −0.59 −0.59 0.00%

 10.45% −0.67 −0.60 −0.58 −0.62 (ܽ2ܩ 2,ܫܷܧ)ݒܿ ߨ௧ ଶଵହ 0.21 0.21 0.21 0.21 0.00%

<sup>ଶ</sup> 213.28 217.99 215.87 248.95 13.29%

<sup>ଶ</sup> 0.02 0.02 0.02 0.02 0.00%

**1951–1960 After 2015** 

**Figure 8.** Density comparisons for the bivariate generative model. **Figure 8.** Density comparisons for the bivariate generative model.

#### **6. Discussion 6. Discussion**

Using our proposed method to generate synthetic data is based on a distributional overview of the energy performance across all of the buildings in a district or urban area. However, our method does not take any spatial or temporal information into account. Imposing spatio-temporal mapping onto the synthetic data will help to draw an even better picture of building energy performance at the urban scale. As shown in Figure 9, spatial mapping takes the geolocations of buildings into consideration. Practically, the size of Using our proposed method to generate synthetic data is based on a distributional overview of the energy performance across all of the buildings in a district or urban area. However, our method does not take any spatial or temporal information into account. Imposing spatio-temporal mapping onto the synthetic data will help to draw an even better picture of building energy performance at the urban scale. As shown in Figure 9, spatial mapping takes the geolocations of buildings into consideration. Practically, the size of unlabeled data is far larger than labeled data. The performance of supervised learning models varies with limited labeled data. By including physical features, synthetic data are used to select and validate the supervised learning models for each population in a much more robust way.

The GMM method discussed in this work handles temporally aggregated data. With temporal mapping, it is possible to create higher time resolution data (for example hourly data) by including additional variables such as building class and hourly weather data. A set of buildings with known hourly energy consumption and building class could be taken as a set of reference buildings. By weighting from the reference buildings and sorting out a convex optimization problem, the synthetic data could then generate a set of energy performance profiles [60].

**Figure 9.** Spatio-temporal mapping of urban buildings from GMM synthetic data. **Figure 9.** Spatio-temporal mapping of urban buildings from GMM synthetic data.

#### **7. Conclusions 7. Conclusions**

more robust way.

performance profiles [60].

Big data are costly to collect and use. Even when this process is automated, many data collection systems operate with incomplete data or, in the case of building energy performance, are only able to collect data on a limited scale. From a statistical point of view, data from different groups are mixed together with little attempt made to distinguish between their representative populations. This hinders the efficient modeling of energy performance data, particularly at a large scale, and the construction of synthetic data for target groups. In this paper, we proposed a GMM with an EM algorithm that can model building Big data are costly to collect and use. Even when this process is automated, many data collection systems operate with incomplete data or, in the case of building energy performance, are only able to collect data on a limited scale. From a statistical point of view, data from different groups are mixed together with little attempt made to distinguish between their representative populations. This hinders the efficient modeling of energy performance data, particularly at a large scale, and the construction of synthetic data for target groups.

unlabeled data is far larger than labeled data. The performance of supervised learning models varies with limited labeled data. By including physical features, synthetic data are used to select and validate the supervised learning models for each population in a much

The GMM method discussed in this work handles temporally aggregated data. With temporal mapping, it is possible to create higher time resolution data (for example hourly data) by including additional variables such as building class and hourly weather data. A set of buildings with known hourly energy consumption and building class could be taken as a set of reference buildings. By weighting from the reference buildings and sorting out a convex optimization problem, the synthetic data could then generate a set of energy

energy performance data for both univariate and bivariate cases. The energy performance indicators of a sample of buildings from Boston and Aarhus were adopted to segment mixed populations. For the univariate case, the Energy Use Intensity data from the two different cities were analyzed, and the updates were shown to have quick convergence. The derived models were able to capture the distributional features and to reflect the true population. The classification rate was almost 90%, and the generated data matched in quantile to the observed data. For the bivariate case, we showed that the inclusion of the In this paper, we proposed a GMM with an EM algorithm that can model building energy performance data for both univariate and bivariate cases. The energy performance indicators of a sample of buildings from Boston and Aarhus were adopted to segment mixed populations. For the univariate case, the Energy Use Intensity data from the two different cities were analyzed, and the updates were shown to have quick convergence. The derived models were able to capture the distributional features and to reflect the true population. The classification rate was almost 90%, and the generated data matched in quantile to the observed data. For the bivariate case, we showed that the inclusion of the new variable daily heat load variation further increased the power in parameter estimation, thus making the classification rate higher than in the univariate case. These data not only generate reliable density representations, but they also can be adjusted according to the real building capacity of a city with a spatio-temporal mapping.

Moreover, there are a number of topics in connection with these findings that would be interesting to explore in future studies.


• Finally, it is also appealing that more probability distributions could be studied instead of using data transformation.

**Author Contributions:** Conceptualization, M.H., Z.W. and X.Z.; methodology, M.H. and Z.W.; software, M.H.; validation, M.H.; formal analysis, M.H. and Z.W.; data curation, M.H.; writing—original draft preparation, M.H. and X.Z.; writing—review and editing, M.H. and X.Z.; funding acquisition, M.H. and X.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the IMMA research network between Microdata Analysis and Energy and Built Environments in the School of Technology and Business Studies at Dalarna University; and the UBMEM project at the Swedish Energy Agency (Grant no. 46068).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Buildings* Editorial Office E-mail: buildings@mdpi.com www.mdpi.com/journal/buildings

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34

www.mdpi.com

ISBN 978-3-0365-4563-9