2.2.1. Simple Model Challenges and Suggested Approach
(1) Concept used in the simple model
With the penetration and integration of artificial intelligence (AI), the use of AI, machine learning, and data-driven methods for building environment analysis and optimization have become increasingly important [
52,
53]. Deep learning algorithms are based on representational learning of data in machine learning, which aims at finding better representations and creating better models to learn these representations from large amounts of unmarked data. In simple terms, a deep learning neural network is a neural system mimicking the human brain and constructing a non-linear relationship between input and output.
The fundamental purpose of this paper is to propose a generalized simple model based on a deep learning algorithm that can accurately predict the overheating risk of a ZEB with a small number of input variables. This study further investigates the potential of the combination of unsupervised algorithms and supervised deep learning in predicting indoor thermal comfort.
Initially, the output variables are defined as indoor temperature and CO2 density. Indoor temperature is the critical index used to evaluate the overheating risk of the building. In addition, indoor CO2 density, which represents occupancy, has a high impact on indoor temperature and overheating risk and is selected as the output. But it is difficult to measure CO2 density directly, and it generally varies with human activity, so it is essential to predict CO2 density with the proposed model. The intensity of CO2 shows the density of occupants, which means the amount of internal heat generated in the room. The amount of internal heat generation directly affects the indoor temperature and is one of the factors that cause overheating in ZEB. Furthermore, the forecast duration covers the period from 1 May to 31 October due to the climate characteristics of Korea. Usually, building design standards and indoor thermal environmental standards only specify the hygrothermal parameters of buildings in summer and winter but neglect the intermediate season (spring and autumn). However, there is a great temperature difference between day and night in the intermediate season, for example in May and October in Korea.
Figure 4 presents the basic process used to establish the prediction model. Before establishing the prediction model, the raw data should be preprocessed. The box plot can be used to detect and process outlier data from experimental raw data, thus avoiding interference caused by physical errors in the modeling. The pre-processed data are then randomly divided into a training dataset and a testing dataset, and only the training dataset is used in the modeling process. After that, the feature variables are selected through Pearson correlation analysis, and the set of feature variables for modeling is determined. The first step of modeling is to use unsupervised deep learning to add operational pattern identification tags as model inputs. The second step is that supervised deep learning is applied for developing prediction models. Finally, the output result should be validated with the testing dataset.
(2) Steps of model estimation
(a) Input variables selection
The purpose of input selection is as follows: (1) to find out the most effective and correlated variables among the entire dataset; (2) to discover the low repetitive and highly correlated variables to save computational time; (3) to select easily obtained variables so as to improve the applicability and robustness of the model.
In statistics, there are three commonly used correlation coefficients: Pearson correlation coefficient, Spearman correlation coefficient, and Kendall correlation coefficient. Among the three correlation coefficients, the Pearson correlation coefficient is used to measure the degree of linear correlation in this study. Spearman and Kendall are rank correlation coefficients [
54] used to reflect the degree of rank correlation.
The Pearson correlation indexes [
55] are shown in
Table 4. Apparently, the FCU return water temperature and indoor relative humidity show the highest correlation with indoor temperature, followed by solar irradiation, outdoor temperature, pump power (operation variable), and outdoor relative humidity. Since the operation variables of cooling equipment can only be obtained via installing specific sensors, using these variables as input data will limit the broad applicability of the prediction model. Hence, the indoor relative humidity, solar irradiation, outdoor temperature, and outdoor relative humidity are ultimately chosen as input variables of the prediction model for indoor temperature prediction.
Similarly, as shown in
Table 5, the correlation values with CO
2 density ranked as follows: indoor temperature, return water temperature, indoor relative humidity, solar irradiation, outdoor temperature, pump power, and outdoor relative humidity. Therefore, it is necessary to add indoor temperature as a new variable to participate in predicting CO
2 density for higher accuracy.
(b) Clustering algorithm selection
A clustering analysis is used in this study to detect fault data and identify the indoor environment mode initially. Cluster analysis maximizes the similarity between data samples in the same cluster and minimizes the similarity between data objects in different clusters in the final partition results. The massive data are categorized to differentiate their patterns and explore stronger rules for the prediction model. A self-organizing mapping (SOM) neural network, also known as a Kohonen network, is an unsupervised competitive learning network proposed by Kohonen et al. in 1981. As a nonlinear unsupervised clustering algorithm, it has been applied widely in artificial neural networks [
56]. The algorithm gathers similar samples into the same category according to the distance to achieve data clustering. In the learning process of this network, the competition among neurons is unsupervised. In the training process of the network, the network will automatically find possible laws from the distribution characteristics to topology of the input vectors and adjust the weights among nodes of the network adaptively, and finally complete the clustering of the input data. Therefore, this method has been used widely in clustering analysis, signal processing, data dimension reduction, and other fields [
57].
(c) Prediction algorithm selection
Three machine learning methods are selected in this study to participate in building the simple model: Back propagation (BP) neural network, radial basis function (RBF), and long short-term memory (LSTM). BP is a classical feed-forward neural network, RBF is a special neural network based on radial basis function, and LSTM represents a feedback neural network.
The long short-term memory neural network (LSTM) [
58] is a special type of Recurrent Neural Network (RNN) that can learn to rely on information for a time series, which aligned with our research because LSTM can not only process single data points, but also entire sequences of data or historical states. It is suitable for numerical sequences of indoor temperature arranged chronologically, for multivariable, strongly coupled, and severely nonlinear relationships, and also for situations where it is difficult to describe their statistical significance in terms of functions. The LSTM neuron structure is shown in
Figure 5. There are three door structures in the neuron structure: the input gate, the output gate, and the forget gate. The first step is to decide which information will be discarded from the cell status through the forget door. The second step is to determine which information will be placed in cells in the input gate, and the third step is to set the output value in the output door.
The BP [
59] has very good nonlinear fitting ability, which can be used to identify complex and nonlinear systems. In particular, BPNN can build a relatively good functional relationship between input signals and output signals using original samples to train the network, so it is more suitable for short-term prediction.
The RBF plays an important role in the field of neural networks. For example, RBF neural networks have the unique best approximation property. As a kernel function, a radial basis function can map input samples to high-dimensional feature space and solve some problems that are originally linear and inseparable.
The prediction model proposed in this study is shown in
Figure 6. The whole prediction process is divided into two layers. Outdoor temperature, solar irradiation, relative humidity, and indoor humidity are input variables for the first prediction model, and indoor temperature is added as a new input to estimate CO
2 density in the second loop. The accuracy of the second layer is determined by the first layer. That is, the prediction accuracy of CO
2 density is guaranteed by the accurate prediction of indoor temperature.
(d) Evaluation index illustration
Three evaluation metrics, root mean square error (RMSE), mean square error (MSE), and r-squared (R
2), are used to evaluate the performances of those prediction models. RMSE [
60], known as the standard error reflects the average deviation between the predicted values and the real value. MSE [
61] refers to the average value of the relative error, which is used to compare the reliability of the prediction model.
R-squared (R
2) [
62] is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model. Whereas correlation explains the strength of the relationship between an independent and dependent variable, R
2 explains to what extent the variance of one variable explains the variance of the second variable.
2.2.2. Prediction Model Evaluation
Each of the predictive models mentioned above has its own advantages. For comparative analysis of the prediction accuracy of each model, the performance of the LSTM model without data clustering and the SOM-BP, SOM-RBF, and SOM-LSTM models with data clustering were evaluated. The performances of the four models, SOM-BP, SOM-RBF, SOM-LSTM, LSTM, are summarized in
Table 6 and
Figure 7 and
Figure 8. SOM-LSTM produces the most accurate results among the prediction models in this study; SOM-BP also performs well. In the case of the SOM-RBF model, the predictability decreased over time. In the case of LSTM without data clustering, it was shown that there is a deviation according to the prediction interval.
Figure 7a,b show the predicted results and actual value (measured value) comparison of the four models for indoor temperature and CO
2 density, respectively. SOM-LSTM was the most similar to the actual data with the highest prediction accuracy. Since the SOM-LSTM model determines and uses the influence of past predicted values over time, it shows that accurate prediction is possible even after the elapse of time.
For predictive models, stability under large fluctuations of the dataset is as important as accuracy. Therefore, the boxplots of the accuracy results are shown in
Figure 8 to analyze the stability of each prediction model.
Apparently, the SOM-LSTM model shows the best prediction performance, with an accuracy of over 95%, for the prediction of indoor temperature, and an acceptable accuracy of around 90% for the prediction of CO2 density. The results also demonstrated the feasibility of forecasting the CO2 density by introducing indoor temperature as a second time input variable.
Table 6 shows the results of the four models with three evaluation indexes: MSE, the RMSE, and the R
2. In terms of predicting the indoor temperature and CO
2 density, the SOM-LSTM method has superior performance to the LSTM, SOM-BP, and SOM-RBF methods. Thus, the proposed model using the LSTM algorithm with the SOM clustering method (SOM-LSTM) can reliably predict the indoor temperature and CO
2 density from 1 May to 31 October. Further thermal comfort assessment and association analysis can be performed based on this predicted dataset.