**Building Energy and Environment**

Editors **Shi-Jie Cao Wei Feng**

Basel • Beijing • Wuhan • Barcelona • Belgrade • Novi Sad • Cluj • Manchester

*Editors* Shi-Jie Cao Southeast University Nanjing, China

Wei Feng Chinese Academy of Sciences Shenzhen, China

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Topic published online in the open access journals *Energies* (ISSN 1996-1073), *Buildings* (ISSN 2075-5309), and *Designs* (ISSN 2411-9660) (available at: https://www. mdpi.com/topics/building).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

Lastname, A.A.; Lastname, B.B. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-9624-2 (Hbk) ISBN 978-3-0365-9625-9 (PDF) doi.org/10.3390/books978-3-0365-9625-9**

© 2023 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license. The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license.

## **Contents**





### **Antonios A. Lithourgidis, Vasileios K. Firfiris, Sotirios D. Kalamaras, Christos A. Tzenos, Christos N. Brozos and Thomas A. Kotsopoulos**


#### **Gabriela Bastos Porsani and Carlos Fern´andez Bandera**


### *Article* **eSCIFI: An Energy Saving Mechanism for WLANs Based on Machine Learning**

**Guilherme Henrique Apostolo 1,2,***∗***, Flavia Bernardini 1, Luiz C. Schara Magalhães 2,3 and Débora C. Muchaluat-Saade 1,2**


**Abstract:** As wireless local area networks grow in size to provide access to users, power consumption becomes an important issue. Power savings in a large-scale Wi-Fi network, with low impact to user service, is undoubtedly desired. In this work, we propose and evaluate the eSCIFI energy saving mechanism for Wireless Local Area Networks (WLANs). eSCIFI is an energy saving mechanism that uses machine learning algorithms as occupancy demand estimators. The eSCIFI mechanism is designed to cope with a broader range of WLANs, which includes Wi-Fi networks such as the Fluminense Federal University (UFF) SCIFI network. The eSCIFI can cope with WLANs that cannot acquire data in a real time manner and/or possess a limited CPU power. The eSCIFI design also includes two clustering algorithms, named cSCIFI and cSCIFI+, that help to guarantee the network's coverage. eSCIFI uses those network clusters and machine learning predictions as input features to an energy state decision algorithm that then decides which Access Points (AP) can be switched off during the day. To evaluate eSCIFI performance, we conducted several trace-driven simulations comparing the eSCIFI mechanism using both clustering algorithms with other energy saving mechanisms found in the literature using the UFF SCIFI network traces. The results showed that eSCIFI mechanism using the cSCIFI+ clustering algorithm achieves the best performance and that it can save up to 64.32% of the UFF SCIFI network energy without affecting the user coverage.

**Keywords:** WLAN energy saving mechanism; machine learning; RoD strategy mechanisms; smart buildings; Wi-Fi networks

#### **1. Introduction**

The presence of Wireless Local Area Networks (WLANs) on shopping centers, convention centers, and commercial and university buildings has been increasing daily [1]. To cope with the increasing demand, new wireless Access Points (*APs*) are added to the network infrastructure in order to supply user demand with good Internet connection [2]. However, the deployment of new *APs* not only rises the network infrastructure maintenance cost, but also its operation costs [3]. These higher operation costs are mostly caused by energy consumption [4,5].

The energy consumption of large-scale wireless networks has raised concerns among researchers [3,4,6–9]. There are several studies in the literature proposing Resource on Demand (RoDs) strategies to improve the energy efficiency of those networks [7,10–14]. Wi-Fi RoD strategy management systems, or simply RoD strategy mechanisms, implement algorithms and policies that decide which *APs* should be switched off to save energy and which *APs* must stay switched on to cope with the traffic demands [1].

Some mechanisms use real time data acquisition or sophisticated RoD strategies that require great processing requirements to create their energy saving solutions [4,15,16].

**Citation:** Apostolo, G.H.; Bernardini, F.; Magalhães, L.C.S.; Muchaluat-Saade, D.C. eSCIFI: An Energy Saving Mechanism for WLANs Based on Machine Learning. *Energies* **2022**, *15*, 462. https:// doi.org/10.3390/en15020462

Academic Editors: Shi-Jie Cao and Wei Feng

Received: 13 December 2021 Accepted: 6 January 2022 Published: 10 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

1

However, there are wireless network scenarios where the Central Processing Unit (CPU) power is limited. This restriction makes the use of real-time occupancy's measurement and prediction impractical. Nevertheless, even those network scenarios could benefit from RoD strategy mechanisms and few to no adjustments would be required.

The Fluminense Federal University (UFF) SCIFI wireless network is a large-scale network developed by UFF, initially financed by RNP (Brazilian National Research and Education Network) [17]. The SCIFI network is a low-cost solution for large-scale wireless networks. Its implementation is open source and allows the control and management of those networks. The SCIFI network has two main components: The SCIFI smart controller; and the running *APs*, operating under the open source OpenWRT firmware [18].

It is possible to apply machine learning predictions to estimate *APs* occupancy demand in the UFF SCIFI wireless network scenario [19]. The key idea behind that is to use machine learning models predicted occupancy demand to control the power state of the *APs* during the day. The responsible for switching off *APs* according to the estimated demand for each time slot is the Wi-Fi network controller. The Wi-fi controller can do that by using RoD strategy mechanisms based on machine learning estimated demand. Machine Learning models are responsible for future occupancy demand estimations that our RoD mechanism bases its decisions on. Consequently, we need to conduct an analysis on the performance of machine learning models for our scenario.Some work use the Wi-Fi infrastructure to gather information about the Wi-Fi network occupancy history and use different classification and regression machine learning models to predict network usage [3,20–24]. However, to the best of our knowledge, none of them have investigated the combination of regression and classification predictions to improve the demand estimation accuracy or the combination of two RoD strategy algorithms to ensure client's association and the network minimum coverage for Wi-Fi networks.

This work proposes eSCIFI, an energy saving mechanism for WLAN. eSCIFI uses machine learning models to predict the wireless network future demand, therefore it can work in wireless networks where the controller's CPU power does not allow real time data acquisition to estimate this demand. eSCIFI uses two RoD strategy algorithms to ensure clienst' association and the network minimum coverage: the *AP* clustering algorithm and the double threshold algorithm. The eSCIFI mechanism can determine which *AP* should be active or turned off during certain moments of the day in order to cope with the actual network demand and also save energy.

The main contributions of this work are:


The remainder of this work is organized as follows. Section 2 presents the related work to WLAN energy saving mechanisms. Section 3 describes the eSCIFI energy saving mechanism solution proposed in our work. Section 4 covers the evaluation of the proposed eSCIFI energy saving mechanism. Finally, Section 5 concludes this work, pointing out some enhancements and applications that might be explored in future work.

#### **2. Related Work**

Based on the work of Budzisz et al. [25], Jardosh et al. [2] and Lorincz et al [26], we developed an extended taxonomy that helped us to compare distinct RoD strategy mechanisms for WLANs. Our taxonomy consists of seven non-overlapping categories, corresponding to the main characteristics of related work: (1) network type, (2) WLAN application scenario, (3) control scheme, (4) operation strategy, (5) metrics, (6) algorithm type, and (7) evaluation method.

Most related work have developed RoD strategy mechanisms for Wi-Fi (IEEE 802.11) networks. However, there are great contributions in the literature that developed RoD strategy mechanisms for mesh [27] and cellular networks [8,24,28,29]. Those wireless network types have distinct characteristics, but the strategies and algorithms used on their RoD strategy mechanisms are interchangeable and sometimes even overlapping. It is important to notice that an RoD strategy mechanism developed and tested for a specific wireless network can be used in other wireless network types. Therefore, the network type category does not mean any sort of limitation to the RoD strategy mechanism applicability, but only describes the type of network used as the motivation and experimental scenario.

Most of the RoD strategy mechanisms were developed for application scenarios where they depend on homogeneous WLANs to operate, such as [2,7,11,12,14]. In those cases, the RoD strategy mechanism is implemented to fully cope with the WLAN technology without depending on any other wireless networks that might work in that area to help to implement their energy saving strategies. However, there are some RoD strategy mechanisms that were designed to operate in heterogeneous WLAN scenarios such as [9,30,31]. In the heterogeneous WLAN application scenarios, the WLAN can rely on other wireless technologies such as Bluetooth or in a separate wake-up radio transceiver to detect user activity while the WLAN infrastructure is turned off. The RoD strategy mechanism developed for heterogeneous WLAN application scenarios can usually achieve higher energy saving rates without affecting their user Quality of Service (QoS), since there is always a supportive wireless network to detect new users instantly. However homogeneous networks are less complex in terms of deployment, control and management, due to their independent WLAN nature.

The control scheme category expresses how the RoD strategy mechanism implements its energy saving strategy. The control scheme can be centralized or distributed. RoD strategy mechanisms with centralized control scheme uses a central controller to supervise the network and send the commands to *APs*. Centralized control schemes are more common for large wireless networks since most of them already have a central controller and their *APs* usually are not powerful enough to implement the algorithms and calculations needed. However, the centralized control scheme can be subdivided into two categories depending whether the central controller is designed for a Software Defined Network (SDN) or not.

SDNs separate the control and data plane by introducing a centralized controller that is responsible for resolving flows forwarding policies and assigning them to the switches' forwarding tables [10]. Some related work [10,14–16] developed energy saving mechanisms for SDN-based networks with a centralized SDN controller. The use of SDN controllers allows those energy saving mechanisms to use some collected network information such as network topology and traffic usage easily. However, not every large scale WLAN controller is based on the SDN paradigm and therefore can not count on all its advantages.

There are some proposed energy saving mechanisms in our related work that do not consider the controller to be SDN-based [3,8,27,32]. Those energy saving mechanisms also work with a centralized control scheme, but with non-SDN controllers which make them a feasible solution to WLANs where not all SDN advantages are present. On the other hand, in a distributed network, the WLAN elements are all responsible for controlling their energy state and deciding whether they can be turned off or not. However, it is important to highlight that a distributed control scheme does not necessarily mean that each WLAN *AP* works independently of the other. In [33,34], the Wi-Fi *APs* implement an energy saving strategy without a central controller, but they use out-of-band communication between them to decide which *APs* can be turned off.

RoD strategy mechanisms can be classified into two operation strategies: demanddriven or schedule-driven. Demand-driven strategies collect real-time information from the WLAN resources to estimate user demand [2]. The advantage of these strategies is that they can generate an energy saving in the WLAN while satisfying the user demand. However demand-driven strategies have a higher CPU power cost due to the overhead of assessing user demands continuously [6]. Demand-driven strategies are more suitable in

scenarios where the user demand may unpredictably vary over time such as in stadiums [7]. On the other hand, schedule-driven strategies use predefined schedules to produce its energy saving. These schedules can be obtained with machine learning models trained with WLAN historical usage data [3,12,29] or can be based on the administrator's experience [14]. The advantage of using schedule-driven strategies is their low CPU power requirements. Schedule-driven strategies are only suitable for scenarios where user demand is predictable, such as university networks [3,12,14].

The RoD strategy mechanisms can be divided into 4 metrics subsets according to the metrics they use to minimize the energy consumption. The most common and most intuitive metrics are the traffic metrics subset. The traffic metrics subset comprises any network traffic related metric such as number of associated users [3,12,32], throughput [8] or more sophisticated ones such as channel utilization [2]. Traffic metrics are often used and measured in a network, and therefore they are easily accessible, but they might not be enough to guarantee the QoS or coverage alone. Coverage metrics are used to ensure that the whole radio area network [2,6,14] and users [7,15,16] will be covered. Coverage implies that the RoD energy saving strategies will guarantee that all users can connect to at least one active radio. QoS metrics are often used in studies that try to minimize the impact on the user's service [13,28], but they also imply smaller savings or more complex algorithms to work. Energy metrics [4,26] consider the reducing energy quantitative for the analysis of switching on/off strategies. A clear implication is that the user's traffic or QoS constraints can not be met. One important thing to highlight is that every metric alone has its advantages and weaknesses, therefore most related work uses a combination of metrics to guarantee the user's demand will be met.

RoD strategy mechanisms can also be divided by the type of algorithm used for making the energy status decision for the WLAN resources based on the available metrics. Heuristic algorithms can rapidly determine a solution within reasonable time using reasonable resources [26]. As the name suggests, heuristic algorithms are based on heuristics solutions that are easier to implement and usually based on thresholds [6,12,29] or other metrics combination rules [30]. Heuristic algorithms are usually most suitable for WLAN scenarios where the CPU power and/or computational time required are low. On the other hand, optimization algorithms are based on different mathematical problems and solvers that guarantee the best possible solution to a specified problem [4]. Optimization algorithms require more time and resources to provide their solution and therefore are suitable for WLAN scenarios where the CPU power and/or computational time required are high. Related work show that optimization algorithms achieve better results when compared to heuristic ones [15,16], however, Lorincz et al. [26] concluded in their work that "heuristic algorithms can be valuable alternatives offering good solution in reasonable amount of time".

Lastly related work can be divided according to the experimental test made to evaluate their RoD strategy mechanism performance. Simulation tests are those that make use of simulation software such as Matlab [35], Scenargie [36] or NS-3 [14] to recreate their WLAN scenarios and evaluate performance. Trace-driven tests are those that use network traces to reproduce a real network scenario comparing how their network would respond to the changes in that scenario using distinct energy saving mechanism [3,12,24,29]. Testbed experiments are those where a real WLAN infrastructure is used, but a limited set of users and their behavior are simulated [1,7,37]. There are related work that refer to their tests as real network scenario tests [2,7], however, they do not analyze the real infrastructure in a regular usage scenario with undefined users or behaviors and therefore we classified them as testbed.

Table 1 compares related work to our proposed eSCIFI mechanism. It is important to highlight that eSCIFI can be used in a wider range of WLAN networks than most of the mechanisms presented in related work that have a centralized control scheme, since eSCIFI can cope with non-SDN-based wireless networks, does not have a high-CPU-power controller and cannot collect data in real time. Those characteristics make any energy saving

mechanism that presents optimization algorithms or demand-driven strategies unpractical. However, it is worth to mention that, in WLAN scenarios that present those characteristics, eSCIFI can work normally, but it might not be the best practical solution since it might not take advantage of those characteristics.

The eSCIFI characteristics make it a feasible solution for our motivation and evaluation scenario once it allows the development of an energy saving mechanism that can cope with the UFF SCIFI pure Wi-Fi network characteristics. eSCIFI presents a centralized controlling scheme, a schedule-driven operation strategy based on machine learning, using heuristic algorithms, traffic and coverage metrics.

Therefore, we can summarize the eSCIFI key contributions and advantages as:


**Table 1.** RoD strategy mechanism related work comparison.


#### **3. Proposed eSCIFI Mechanism**

In our previous work [39], we created a dataset using real user data collected from a subset of *APs* of the UFF SCIFI network located in a specific building of the engineering campus (the H building). That dataset provides the occupancy estimations for the H building during a period of 6 months (from April 2018 to September 2018). These features makes our dataset one of the biggest, most recent [3,12,32], and the only publicly available to the best of our knowledge. From the occupancy analysis [39], it was possible to observe that most network *APs* at the H building are switched on despite being idle. That active idleness causes an unnecessary waste of energy. Therefore, an energy saving WLAN mechanism based on RoD strategies, or simply RoD strategy mechanisms, that effectively controls WLAN resources can help to prevent those energy waste while coping with the user demand.

This work proposes the eSCIFI energy saving mechanism for WLANs. eSCIFI uses machine learning prediction models and other RoD strategies to create an energy saving mechanism. eSCIFI can also work with non-SDN large wireless networks and/or large wireless networks where real-time data acquisition is not possible. Those possibilities make the eSCIFI a feasible solution for a greater number of wireless networks in use, especially university networks, such as the UFF SCIFI network, which was used for evaluating our proposal.

#### *3.1. eScifi Mechanism Overview*

Figure 1 shows eSCIFI main architectural components and its major steps, which are (i) the unified methodology; (ii) the hybrid model; (iii) heuristic algorithm.

The first step, shown in the left upper part of the figure, is to use our unified methodology to create the datasets and select the best regression and classification model configuration parameters. Later on, in the hybrid model, we combine the best trained regression and classification models selected in our unified methodology to give the future *AP* occupancy estimation. That occupancy estimation is used by our heuristic algorithm to define which *APs* should be turned on or off.

In the heuristic mechanism, we first extract the *AP* statistics from the dataset. Later on, the heuristic network clusters formation uses the *AP* neighborhood list and the *AP* statistics to create the network clusters that can guarantee a minimum network coverage. Finally, the energy state decision algorithm uses the defined network clusters and the *AP* occupancy estimation to decide which *APs* should be switched on/off to cope with the user demand. At the end of this process, our heuristic mechanism provides an energy scheduling of all *APs* in the network for an entire day that can guarantee a minimum coverage to the network while coping with the user demand. That way the eSCIFI mechanism needs to run only once a day to generate the energy scheduling of all *APs* in the network. Therefore, the eSCIFI mechanism can run at any moment of low activity in the network such as late night hours after midnight in our case. This scheme guarantees that eSCIFI can run at any network controller without burdening its processing capacity.

#### **Figure 1.** eSCIFI architecture.

#### *3.2. Unified Methodology and Model Selection*

The unified methodology proposed in [39] explains how the occupancy count (the amount of devices connected to an *AP*) and occupancy detection (if the *AP* is occupied or not) datasets were created. In summary, we have processed *AP* event logs to filter information about the association status between mobile stations and *APs*. Each day was divided into 144 time slots (10 min each), and for each time slot the number of associated devices was processed. This was computed for all the *APs* involved. The datasets show occupancy count and detection of 28 *APs* in a classroom building at UFF's Engineering Campus over a period of 6 months, from April to September 2018. Those datasets were crucial to extract the *AP*'s statistics that were necessary for the network clusters formation. They were also crucial to the model selection process.

The model selection process in our unified methodology compares several model configuration and hyperparameters in order to determine the best classification and regression models for our evaluation scenario. The evaluation involved the use of multiple classification (for occupancy detection) and regression (for occupancy count) models using a variety of configurations and algorithms. The following algorithms were used as classification models: Decision Tree (DT); K-NN; Random Forest (RF); and MultiLayer Perceptron neural network. As regression models, we used DT, K-NN, RF, XG optimized gradient boosting, support vector machine (SVM), stochastic gradient descent (SGD) algorithms and MultiLayer Perceptron neural network.

To evaluate these models, we applied a train/test split on our data where the dataset's association data from April to August were used for training, and the dataset's association data for the month of September were used for testing. We used several metrics to evaluate our classification (such as Accuracy and F-1 score) and regression (such as root mean square error and mean average percentage error) models performance. From the results shown in [39], it was possible to select the best classification and regression model for our scenario.

Results showed that the best classification and regression models were a single-label regressor and classifier trained using the decision tree algorithm in a collective manner where only one classifier and regressor were trained to predict the occupancy of all *APs* based on the previous data of all *APs*. The output feature is the occupancy estimation for the specific time slot. The models used the following attributes: Month, Day, Day of the Week, Holiday, Access Point Id, Hour, and Minute.

Therefore only one classifier and one regressor is needed for our scenario. Those classification and regression models used the decision tree machine learning algorithm and three input features (*AP* identification, day of the week and holiday). *AP* identification (APid) carries the access point identification number, day of the week indicates the respective week day and holiday indicates if the day is a day with lectures or not. Those are the machine learning classification and regression models selected and they will be used on the hybrid model to provide future usage predictions for the H building UFF SCIFI Wi-Fi network in our evaluation.

#### *3.3. Hybrid Model*

The results in [39] showed that even the best regression model has significant Root Mean Squared Percentage Error for a specific time slot *tj* (*RMSPEtj*) values during night and morning time slots, but the *RMSPEtj* values for time slots after midday decrease. On the other hand, the best classification model has relatively higher accuracy for a specific time slot *tj* (*Atj* ) values for night and morning time slots than for the rest of the day. Therefore we propose a hybrid model. The hybrid model combines the accuracy results given by the classification models with the regression results given by the regression models in order to create a better occupancy count estimation. Considering *CMR* as the classification results matrix that shows the occupancy detection estimations provided by the classifier for the *APs* and *RMR* as the regression results matrix that shows the occupancy count estimations provided by the regressor, we can define that the hybrid model estimation *HMR* is the Hadamard product result between both *CMR* and *RMR* matrices. Equation (1) shows the Hadamard product that produces the hybrid model results matrix that is used as the demand estimation by our mechanism.

$$HMR = CMR \diamond RMR \tag{1}$$

Figure 2 shows how the hybrid model demand prediction results are closer to the real demand than the regression model demand predictions for the month of September 2018. In fact, Figure 2 shows that the hybrid model results can reduce the over demand

prediction that happened on the weekends (September 1, 2, 8, 9, 15, 16, 22, 23, 29, 30) and on the Brazil's Independence day public holiday (September 7).

**Figure 2.** Hybrid model results compared with the real demand and the demand given by the regression results for the whole month of September.

The Hybrid model created only uses the APid, day of the week and holiday attributes as input features. Consequently, there are only 14 possible demand estimations for a specific *AP* (one for each regular day of the week and one for each holiday on these days). Therefore, we decided to compare the results of our hybrid model with a mean estimator. The occupancy count prediction provided by the mean estimator for a specific set of input features (APid, day of the week and holiday) is the average occupancy count of that specific set of input features in the association history. We compared the results of this mean estimator with the results of our hybrid model. Table 2 shows that the hybrid model had better Root Mean Squared Error (*RMSE*), overall Root Mean Percentage Error (*RMSPE*) and overall Mean Absolute Percentage Error (*MAPE*) results when compared to the mean estimator model. Those better results shown in Table 2 can be explained by the fact that the hybrid model has reduced the error predictions that happened on weekends and on public holidays when compared to the mean estimation results. Those reduced demands on weekend and on public holidays were more significant than the errors caused in night time slots by the hybrid model results, and therefore the overall *RMSE*, *RMSPE* and *MAPE* results were better.

It is important to highlight that the difference between the results shown in Table 2 are not significant enough to prove that the hybrid model is a better regression prediction model than the pure regression model or mean estimator for all scenarios. The mean estimator results in our case scenario are very close to those achieved by the hybrid model. However, those results achieved by the mean estimator for our case scenario were only possible due the H building occupancy characteristics. The H building has only classrooms, so its occupation mainly occurs through lectures and exam applications. The lecture's schedule did not change drastically throughout the entire dataset which makes the occupancy behavior periodical and well behaved in our case. This behavior might not be common for other buildings in the university that have other room types, such as professor's offices or laboratories, or even other scenarios such as parks or malls. On other scenarios similar to ours, the mean estimator can be a viable option due to its simplicity. The use of the mean estimator does not impose any change to the eSCIFI operation. However, we decided to use the hybrid model since it has shown better results in our case scenario, specifically on weekends and holidays.


**Table 2.** Mean Estimator and Hybrid models performance results.

#### *3.4. Heuristic Mechanism*

The heuristic mechanism is responsible for providing the SCIFI *APs* energy state (on or off) schedule for a date. It is important to highlight that we only control the *APs* wireless interface energy state due to UFF SCIFI existing infrastructure that only allows us to control its energy state. However, in WLANs where the *APs* are connected to Power over Ethernet (PoE) switches, eSCIFI could normally control the energy state of the *AP* and not only its wireless interface.

Our heuristic mechanism has two main components: the heuristic cluster formation algorithm and the energy state decision algorithm. The clustering algorithm creates the *AP* clusters based on their neighborhood in order to guarantee the network coverage area to the clients. The energy state decision algorithm provides the energy state of all *APs* for a specific time slot and date based on the machine learning occupancy predictions and clusters. In the following sections, we detail the heuristic cluster formation algorithm and the energy state decision algorithm and its challenges.

#### 3.4.1. Heuristic Cluster Formation: cSCIFI and cSCIFI+

Jardsoh et al. [2] proposed a clustering algorithm called green clustering. The idea behind the green clustering algorithm is to create clusters of *APs* that are in proximity of each other. Several *APs* in a large wireless network have overlapping coverage areas in order to cope with higher user demand. Those *APs* are in a spatially neighboring condition that allows one of them to provide coverage to the users of all *APs* in its vicinity. Therefore it is possible to create clusters of neighboring *APs* where any user within the cluster coverage is able to connect to the network as long as at least one *AP* in the cluster is turned on. We propose two heuristic cluster formation algorithms, cSCIFI (cluster SCIFI) and cSCIFI+ (cluster SCIFI+). Those clustering algorithms are based on the green clustering algorithm of Jardsoh et al. [2]. However, we introduced some basic changes to improve the cSCIFI and cSCIFI+ clustering formation process.

Our clustering algorithms need two input features to work: the neighborhood list and the *AP* statistics. To create a neighborhood list, we need to define the vicinity criteria. Only *APs* that are considered neighbors can belong to the same cluster. Jardsoh et al. [2,6] have used the spatial distance between *APs* and the median number of beacon messages and the median signal strength of the beacons as vicinity criteria. In our cSCIFI and cSCIFI+ algorithms, we are going to use the *APs*' signal quality scan to define our vicinity criterion. The SCIFI network periodically runs a signal quality scan that informs the different signal quality values received from the other *APs* that a certain *AP* has scanned. The signal quality is a measurement that goes from 0 to 100 and takes into consideration the Received Signal Strength Indication (RSSI) and other network parameters. We considered *APs* with a measured signal quality above 50 to be neighbors. Therefore, the neighbors of an *AP* are: (i) the *APs* on the same side of the building and floor; (ii) the *APs* that are in rooms directly above and below (e.g., neighbors to the *AP* in room 303 are the *APs* in rooms 403 and 203). With the established vicinity criteria, we can determine which *APs* are neighbors and create a neighborhood set list for each *AP*.

We describe our cluster formation algorithms as follows. Consider *Vi* as the neighborhood set of *APi*, *C* as our cluster set, and *Ci* as the cluster formed starting from *APi*. In our cSCIFI clustering algorithm, we first start by selecting *APs* with the biggest neighborhood set, forming a new cluster *Cs* and adding *APs* to its newly formed cluster *Cs* . When *APs* is added to its cluster *Cs*, the cSCIFI algorithm also removes *APs* from all other *AP* neighborhood sets and update their number of neighbors. Then, the algorithm steps through all the *APs* in *APs* neighborhood set *Vs* and adds *APh* that has the biggest neighborhood set, as long as every new *APh* added to *Cs* is in the neighborhood set of all other *APs* already included in *Cs*. We call this the neighboring condition. As long as *APs* has *APs* on its neighborhood set *Vs* that satisfy the neighboring condition, those *APs* are added to cluster *Cs* and removed from the other *AP* neighborhood sets.

When there are no more *APs* in the *APs* neighborhood set or there are no more *APs* that satisfy the neighboring condition, the algorithm moves to the next *AP* with the biggest neighbor set and continues the cluster formation until there are no more *APs* left and the cluster set *C* is finished.

Algorithm 1 shows the cSCIFI cluster function code, where we can see that every *AP* will only be in one cluster and that every *AP* is on the vicinity of all other *APs* inside its cluster. The neighboring condition (line 5) allows any user in the cluster coverage area to connect to any of the powered-on *APs*, since they are all each other's neighbors.

#### **Algorithm 1** *cSCIFI*

1: **function** Create\_Cluster\_cSCIFI (Cluster\_Head, Cluster\_head\_list\_of\_neighbors):

2: Cluster\_auxiliary\_list = [Cluster\_Head]

3: sort *APs* in Cluster\_head\_list\_of\_neighbors according with the number of neighbors in their neighborhood list


cSCIFI+ is simpler and more aggressive than cSCIFI. The cSCIFI+ clustering algorithm works like the cSCIFI, but now the *APs* added to a certain cluster *Ci* do not need to cope with the neighboring condition. In cSCIFI+, all neighbors in the *APA* neighborhood *AP* set *VA* are added to cluster *CA*.

cSCIFI+ guarantees that the size of cluster set *C* will be the smallest possible. However, users from a switched-off *AP* in the cluster can only connect to *APA* that initiated that cluster. Considering the clusters formed with cSCIFI, users from any *AP* can connect to other *APs* in that cluster, which might balance the load between the switched-on *APs*. Algorithm 2 shows the cSCIFI+ cluster function code, where we can see that only the *AP* that initiated the cluster formation can assure connection to all users from switched-off *APs*, which may sometimes cause congestion.

#### **Algorithm 2** *cSCIFI+*

1: **function** Create\_Cluster\_cSCIFI+ (Cluster\_Head, Cluster\_head\_list\_of\_neighbors):


The cSCIFI and cSCIFI+ greedy algorithms alone cannot guarantee that the best cluster set is formed in cases where there is a tie between *APs*. A solution would be creating all cluster possibilities, choosing each one of the tied *APs* as the first choice. After creating all possible sets *C*, we would select the one that has the minimum number of clusters. Those multiple cluster sets creation can cause a exponential growth in the execution time. Trying to minimize those problems, we simplified the cSCIFI and cSCIFI+ selection in cases of ties. The cSCIFI and cSCIFI+ will only create multiple cluster sets when there are ties between

*APs* that will be selected to initiate a cluster formation. This selection criteria will guarantee that only different clusters initiation will be taken into relevance and not all possible cluster internal formations, which will minimize the possible solution set.

In the cSCIFI algorithm, we also added another selection criterion for cases where there are ties between *APs* to be added to cluster *Ci* where an *APi* has already initiated it. In those cases, *APj* with the highest number of neighbors in the *APi* neighborhood set *Vi* is selected. *APs* with the same number of neighbors in their sets can generate different clusters, since some of their neighbors might not be in *APi* neighborhood set *Vi*. Therefore, in cases of ties, it is the best option to select *APj* that has the biggest number of matching neighbors to the *APs* in the neighborhood set *Vi*. This change in the internal cluster formation process guarantees that the next *APs* to be added will be the ones that will contribute to a bigger cluster size.

Those characteristics cited previously minimizes the execution time and guarantee that a possible cluster set *C* will be selected independent of their appearances on the cluster neighborhood list. This is an important advantage to our clustering algorithms when compared to the green clustering algorithm proposed by Jardosh et al. [2], since we do not need to worry about the *AP* order of appearance in the neighborhood list construction process.

The last characteristic of our clustering algorithms is the cluster head election. The cluster head is the *AP* that will be always switched on and will be responsible for guaranteeing the cluster coverage area. In cSCIFI+, the cluster head will always be the one that initiated the cluster formation. This *AP* is the only *AP* that can be the cluster head, since this is the only *AP* that has a guaranteed neighboring condition to all other *APs* in the cluster. On the other hand, the election of the clusters head in the cSCIFI algorithm can be more sophisticated since all *APs* in the clusters obey the neighboring condition. In clusters formed with cSCIFI, the cluster head will change throughout the day using the statistics of the *APs*. In those clusters, the average association of each *AP* is calculated for the night (0 a.m.–7 a.m.), morning (7 a.m.–1 p.m.) and afternoon/evening (1 p.m.–11 p.m. 59 ) periods. The *AP* with the highest night average association will be selected to be the cluster head for the night period and so on.

#### 3.4.2. Energy State Decision Algorithm

The energy state decision algorithm is responsible for providing the energy scheduling of all *APs* for a date. It runs once a day and uses the traffic demand estimated by the hybrid machine learning model (the user association number in our SCIFI network scenario) to calculate the cluster demand for specific moments of a day and then decide which *APs* in a cluster can be switched off. The energy state decision algorithm is the last step on the eSCIFI energy saving mechanism, and it is responsible for actively deciding which *APs* will be switched on or off and to provide the energy scheduling to the SCIFI controller. The SCIFI controller, based on this energy scheduling, will then control the *AP* wireless interface switching on and off for the specified periods.

Our energy state decision algorithm uses the RoD policy proposed in the work of Dalmasso et al. [8]. However, our energy state decision algorithm works using machine learning occupancy estimations instead of real traffic data and therefore presents some modification in the RoD policy design. This RoD policy has two main components: the time window and the double threshold criteria. The time window defines how long it will take before the algorithm reconfigures the *AP*'s energy state. The time window size *tw* informs on which frequency the network will be reconfigured and also the demand estimation resolution. A small time window will allow the energy state decision algorithm to perceive short bursts in the traffic demand variations. On the other hand, a large time window will only perceive the average traffic where instant or momentarily bursts in the traffic demand will fade. At first, a smaller time window size seems always the best choice, however a smaller time window size means more rounds of energy state decisions will have to be made by the algorithm and that the controller will have to reconfigure the network more frequently. Related work [3,8,12,14,24] state that small time window sizes are not necessary. In fact, depending on the network traffic profile, those changes in the traffic demand can take hours to happen. Therefore, the selection of the time window size is a parameter that needs to be decided based on the network scenario. In Section 4.1, we will deeply discuss the selection of the time window size.

The main concept behind our energy saving strategy is moving the traffic demand from switched off *APs* to the cluster head *AP* or other switched on *APs* in the cluster that can handle them. In the work of Damalso et al. [8], the *APs* in a cluster can be switched off based on the actual traffic demand (real time traffic data) at the beginning of each time window. However, the eSCIFI mechanism uses machine learning models to estimate demand. Therefore, in our energy decision algorithm, the decisions made for each time window will take into consideration the demand estimated for its whole duration and not just the demand at the beginning of the time window.

All *APs* in the network have the same maximum user threshold *Tmax* for a time window. This maximum user threshold *Tmax* defines how much traffic (or how much associations in our case) the *APs* can handle for the duration of the time window. The cluster head of every cluster will always be switched on guaranteeing a traffic capacity of *Tmax* for the cluster. In our energy state decision algorithm, the double threshold criteria defines which *APs* in a cluster can be switched off based on the traffic demand estimated by the machine learning hybrid model for the assessed time window. However, this energy state decision algorithm varies depending whether cSCIFI or cSCIFI+ algorithms are used.

In the cSCIFI algorithm, all *APs* in a cluster are neighbors between themselves. Therefore, in the cSCIFI case, the double threshold criteria defines that *APs* with estimated traffic demand below a minimum threshold *Tmin* for the whole time window are switched off as long as the available traffic capacity provided by all *APs* that are switched on can handle their estimated traffic. Considering *DMi* as the traffic demand of *APi* for a time window, *d* as the number of switched on *APs*, *o* as the number of switched off *APs*, *sum<sup>d</sup> <sup>a</sup>*=1*DMa* as the traffic demand of all *d* switched on *APs* and *sum<sup>o</sup> <sup>a</sup>*=1*DMa* as the traffic demand of all *o* switched off *APs*, we can define how our energy state decision algorithm decides if an *APi* will be switched off based on the double threshold criteria if the cSCIFI algorithm is used. Equation (2) shows the double threshold criteria, where the first criterion defines if the traffic demand is too low for the *APi* to be switched on and the second criterion defines if the cluster switched-on *APs* can handle the *APi* traffic. If there are more *d* switched-on *APs* in the cluster, the cluster maximum traffic capacity *CCA* increases to (*d* + 1)*Tmax* because cSCIFI guarantees that all *APs* inside a cluster can provide connection to any mobile station trying to connect to any *AP* in the cluster.

$$\begin{cases} DM\_i < T\_{\min} \\ \text{CCA} - \left(\sum\_{a=1}^d DM\_a + \sum\_{b=1}^o DM\_b\right) \gtrless DM\_{i'} \quad \text{where} \\ \begin{array}{l} \text{CCA} \\ (d+1)T\_{\max} \end{array} \end{cases} \tag{2}$$

On the other hand, in cSCIFI+, all *APs* in a cluster are neighbors only of the cluster head. Therefore, *APs* with estimated traffic demand below a minimum threshold *Tmin* for the whole time window are switched off as long as the available traffic capacity provided by the cluster head can handle their estimated traffic. Equation (3) shows the double threshold criteria if cSCIFI+ is used, where the first criterion defines if the traffic demand is too low for *APi* to be switched on and the second criterion defines if the cluster head can handle the *APi* traffic. In cSCIFI+, the cluster maximum traffic capacity *CCA* is fixed to *Tmax* because cSCIFI+ guarantees that only the cluster head can provide connection to any mobile station trying to connect to any *AP* in the cluster, except the *AP* itself.

$$\begin{cases} DM\_i < T\_{\min} \\ \text{CCA} - \left(\sum\_{a=1}^d DM\_a + \sum\_{b=1}^v DM\_b\right) \gg DM\_{i\prime} \quad \text{where} \quad \text{CCA} \\ \quad = T\_{\max} \end{cases} \tag{3}$$

eSCIFI has several parameters that must be configured and that may depend on the network usage profile, such as the selection of the time window size and *Tmin* value. In the next section, we evaluate how the different components in the eSCIFI architecture affect the mechanism energy saving capacity and the network coverage to its users. We also compare eSCIFI to other related work about energy saving mechanisms that are applicable in our evaluation scenario.

#### **4. eSCIFI Evaluation**

To evaluate how eSCIFI impacts on the network performance, we performed tracedriven simulations using the real association trace data collected from the UFF SCIFI network. Our trace-driven tests use UFF SCIFI's association traces to reproduce a real network scenario. The idea is to compare how the network would respond to the changes in that scenario using distinct energy saving mechanisms. A trace-driven test does not require using a network simulator as NS-3 for example. It allows estimating the metrics by simply inputting the real association traces to the eSCIFI mechanism and then evaluating if eSCIFI can cope with user demand while saving energy. To perform our simulations, we are going to use the association data collected for one week in September 2018 from the H building at UFF. The week used in our collected data is formed by a weekend (1 and 2 September 2018) and 5 weekdays from Monday to Friday (24–28 September 2018). The weekends used are apart from the weekday dates because there were not complete association history traces for the weekend before or after those weekdays. This might have happened for several reasons such as energy outages or network failures for example. However, the weekend (1 and 2 September 2018) contains the association data for all *APs* in the H building, and therefore will be used to represent Saturday and Sunday in our trace-driven simulations. We are also going to use the Brazil's Independence day public holiday (September 7) to compare and evaluate how eSCIFI impacts the network on holidays.

The work of [40] presents a mathematical formula, indicated in Equation (4), that allows us to determine the energy saving factor *ESF* achieved with the *AP* wireless network interface shut down during periods of time. The formula gives the saved energy percentage when shutting down the *AP* wireless network interface compared to the total energy that would be consumed if its interface works the whole time.

Terms *Pext*\_*on* and *Pext*\_*off* of Equation (4) represent the measured power values in Watts, in the *AP* external power source, when the wireless network interface is powered on and off, respectively. Terms *ton* and *ttotal* represent the amount of time the *AP* stayed with its wireless interface switched on and the total analysis elapsed time, respectively. Equation (4) provides the percentage of energy that could have been saved by switching off the wireless interfaces of the *AP* during the idle time slots. This formula can be easily extended to also provide the network's energy saving factor. To do so, the terms *ton* and *ttotal* must change in order to represent the sum of time that all *APs* on the network and the total time multiplied by the number of *APs* in the network, respectively.

$$ESF = \frac{P\_{\text{ext\\_on}} - P\_{\text{ext\\_off}}}{P\_{\text{ext\\_on}}} (1 - \frac{t\_{on}}{t\_{total}}) \tag{4}$$

From Equation (4), it is possible to notice that *ESF* reaches its maximum power saving factor value, *ESFmax*, when *ton* = 0. This condition represents the scenario where the wireless interface of all *APs* in the network are switched off during the whole time. However, it is also possible to notice that, depending on the scenario and switching off scheme, *ESFmax* can assume several values. Therefore, the normalized energy saving factor, *ESF* given by Equation (5), can better indicate the performance of the mechanism in different scenarios. The normalized energy saving factor *ESF* is limited between 0% and 100% and represents the percentage of the maximum energy saving factor that could be saved.

$$\overline{ESF}(\%) = \frac{ESF(\%)}{ESF\_{\text{max}}(\%)}\tag{5}$$

The work of [12] defines the coverage ratio loss *CR* formula, indicated in Equation (6). The coverage ratio loss is the number of uncovered clients *Ul* by the energy saving mechanism over the total clients in the network *U* within a certain period of time. The coverage ratio loss gives the percentage of clients that could not successfully access the network under the evaluated period.

$$
\mathbb{C}R(\%) = (\frac{\mathcal{U}\_l}{\mathcal{U}} \times 100) \tag{6}
$$

The analysis in this section will evaluate the normalized energy saving factor (Equation (5)) and the coverage ratio loss (Equation (6)) to compare how the eSCIFI mechanism impacts on the network performance. To calculate the coverage ratio loss, we must know the parameter *Tmax* that indicates the maximum number of association an *AP* might support in a time slot. We defined *Tmax* = 300, which is roughly the maximum number of *APs* associated in a time slot registered plus 10%. In our experimental scenario, only the wireless network interface will be switched off. Table 3 shows the consumed power measured for the *AP* model present in the UFF SCIFI network when the wireless interface is switched on and off (*Pext*\_*on* and *Pext*\_*off*). Table 3 also shows what would be the maximum power saving factor, *ESFmax*, which represents the power saving factor percentage if the wireless interface of all *APs* were switched off the whole time. Therefore, in our evaluation scenario the maximum energy saving factor percentage that could be reached by switching off the wireless interface of the entire network during the whole evaluation period is 23,93%. That information is required by the normalized energy saving factor calculations.

**Table 3.** *AP*'s consumed power and maximum power saving factor percentage.


We are going to evaluate how several components from the eSCIFI architecture impact on the network performance. eSCFI using the cSCIFI and the cSCIFI+ clustering algorithms will also be compared with other mechanisms proposed in the literature. The eSCIFI energy saving mechanisms will be compared with SEAR, ACE and ECMA mechanisms proposed by Jardosh et al. [2], Fang et al. [12] and Silva et al.[14], respectively. The SEAR mechanism uses the green clustering algorithms and a single threshold where only the *Tmin* parameter is used as the RoD strategy. In the SEAR mechanism, the network *APs* are grouped into clusters, the cluster head is always switched on and the other *APs* in the clusters remain switched off as long as their traffic demand is lower than *Tmin*. The ACE mechanism uses an inactivity time window based on machine learning occupancy detection results as its RoD strategy and does not have a coverage guarantee. *APs* that remain unused by a whole time window size are switched off the whole time window duration period. The ECMA mechanism uses the SEAR mechanism for night hours (between 0 a.m.–6:59 a.m.) and keeps the whole network switched on the rest of the day. The Baseline mechanism where all the *APs* in the network remain switched on between 7 a.m.–11:59 a.m. and switched off between 0 a.m.–6:59 a.m. is also used for comparison.

We will evaluate how the time window size and the minimum threshold value affect the network performance. After that, we will also compare the eSCIFI mechanism performance on regular weekday with its performance on a public holiday. Our last analysis will compare the SEAR green clustering, eSCIFI with cSCIFI and cSCIFI+ clustering algorithms performances using different neighborhood lists. Our trace-driven simulation scripts were developed in Python.

#### *4.1. Time Window Size Analysis*

In Section 3.4.2, we have seen that the time window *tw* defines how long it takes before reconfiguring the network *APs* energy state. A bigger time window is desired since it will minimize the number of times the controller will need to change the *APs* working status, which will minimize the controller tasks over a day. On the other side, a bigger time window may not notice small traffic demand bursts, which may lead to network coverage losses during these bursts due to unnoticed behaviors. Therefore, we need to evaluate how the eSCIFI time window size may affect the network energy saving and coverage loss. We tested the eSCIFI mechanisms using 5 different time window values (10 min, 30 min, 1 h, 1:30 h and 2 h). Those time window values were selected based on our time slots size and correspond to 1, 3, 6, 9 and 12 time slots, respectively. Those time windows were selected based on the lecture duration time at UFF, which usually takes 2 h. The real and predicted association values for time windows bigger than one time slot (10 min) is the sum of the devices connected during the corresponding amount of time slots. We evaluated eSCIFI using the cSCIFI and cSCIFI+ clustering algorithms and *Tmin* = 72 with different time windows to evaluate the normalized energy saving factor and coverage loss. Those fixed parameters were used because they delivered the best normalized energy saving factor percentage and coverage ratio loss to all possible time windows. We will also evaluate the time window size effect in the SEAR, ACE, ECMA and Baseline mechanisms.

**Figure 3.** Normalized energy saving factor for different time window sizes.

As we can see in Figure 3, the selected time window sizes have not affected the normalized energy saving factor *ESF* for SEAR and eSCIFI using the cSCIFI and the CSCIFI+ clustering algorithms. Only ECMA and ACE had their normalized energy saving factor negatively affected by the time window size. The baseline estimator does not depend on the time window (its scheduling presents fixed switching on/off periods), and therefore we can see that its normalized energy saving factor does not change. This result means that for our evaluation scenario it is possible to use a 2-h time window resolution without affecting the normalized energy saving factor for our eSCIFI mechanism. This would allow the eSCIFI mechanism to compute less energy state changes in the *APs* and consequently less tasks to be execute by the wireless network controller.

Figure 4 shows how the different time window sizes affects the coverage ratio. As we can see only Baseline and ACE presented coverage ratio losses in this evaluation scenario. The baseline estimator has a fixed coverage loss that does not depend on the time window size. The Baseline loss occurs due to unattended users in the night hours where all *APs* are switched off. However, the ACE mechanism shows a small decrease in the coverage loss as the time window grows. That result was expected because ACE uses the time window size as an inactivity criteria to switch off *APs*, and therefore a large time window would require a longer period of inactivity, which would be harder to achieve and consequently would lower the chances of mistakenly switching off *APs*.

**Figure 4.** Coverage loss for different time window sizes.

#### *4.2. Minimum Threshold Analysis*

The last parameter on our eSCIFI mechanism that needs to be evaluated is the *Tmin* value selection. *Tmin* defines the minimum number of associations that an *AP* must have during the time window duration to be switched on. If the number of associations is below *Tmin*, the *AP* will be evaluated to be switched off by the energy state decision algorithm. In this section, we evaluate how the *Tmin* value affects the normalized energy saving factor and the coverage ratio. To do so, we varied the value assumed by *Tmin* during one time slot including all multiples of 9 ranging from 9 to 90. Therefore, the *Tmin* value will be proportional to the time window size used. Therefore, if the time window has a size *w* of time slots, the *Tmin* values assumed will be *w* × *Tmin*. We fixed the time window size to 12 time slots (2 h or 120 min).

Figure 5 shows the normalized energy saving factor achieved by eSCIFI using the cSCFI and cSCIFI+ clustering algorithms, SEAR, ACE and ECMA. As it can be seen in Figure 5, SEAR and eSCIFI using cSCIFI+ got the best energy saving percentages on our evaluation scenario. eSCIFI using cSCIFI had a smaller energy saving percentage because it has a different cluster set that is bigger than the ones formed by the SEAR eSCIFI using cSCIFI+. From Figure 5, we can also see that the normalized energy factor *ESF* grows as *Tmin* grows until it reaches *Tmin* = 54, after that, the energy factor stays the same for all mechanisms. This result was expected and it is the same result achieved by Dalmasso et al. [8]. This asymptotic characteristic in the normalized energy saving factor curve happens because, for values of *Tmin* higher than 54, the cluster maximum capacity *CAA* threshold is reached requiring those same *APs* to be turned on anyway.

**Figure 5.** Normalized Energy saving factor for different *Tmin* values.

Higher *Tmin* values mean that *APs* will require a higher number of associations in a time window to be switched on according to the first criteria, which means it will be harder for them to be switched on. However, those *APs* will have their demand transferred to the cluster head (or other switched on *AP* in the case where cSCIFI has been used). That will mean that the cluster maximum capacity *CAA* threshold will be reached sooner and the *APs* will have to be turned on anyway. Therefore the normalized energy saving factor is limited and there is a *Tmin* value that reaches it. Increasing *Tmin* after its optimum value will not change the normalized energy saving factor. The possible explanations behind that may be that after *Tmin* = 54, SEAR already reaches the minimum required *APs* to guarantee coverage (only the cluster heads may be switched on) or the switched on *APs* after that value present traffic demands much higher than the maximum value of *Tmin* = 90. Figure 5 shows that ECMA has a steady normalized energy saving factor that does not depend on the *Tmin* value. This might happen because ECMA applies the SEAR mechanism in night hours (between 0 a.m–6:59 a.m) and keeps the whole network switched on the rest of the day. The network has very little traffic demands in night hours, therefore few *APs* are required to be turned on or will have enough traffic to trigger the cluster maximum capacity threshold. That way, ECMA already reaches its highest energy saving factor with a *Tmin* = 9. ACE and Baseline do not present a *Tmin* parameter for energy state decision and therefore their normalized energy saving factors do not change.

We also evaluate how different *Tmin* values affect the coverage ratio. As we can see in Figure 6, eSCIFI using both clustering algorithms (cSCIFI and cSCIFI+), SEAR and ECMA strategy had no coverage ratio loss at all for any value of *Tmin*. This results showed that none of the mechanisms had overpassed the maximum cluster capacity at any moment. Only Baseline and ACE present coverage losses. However, as we have already mentioned previously, those mechanisms do not change their energy state decisions based on a minimum threshold *Tmin* parameter. Therefore, their coverage ratio results are the same showed in Figure 4, where the time window size is *tw* = 120 min.

**Figure 6.** Coverage ratio loss for different *Tmin* values.

#### *4.3. Weekday Versus Holiday Analysis*

The eSCIFI uses machine learning prediction models to estimate traffic demands. In our scenario, the hybrid model uses a holiday input feature that distinguishes normal weekdays from public holidays and university student holidays. The hybrid model uses this feature to differentiate the network demand variation that happens between regular day and holidays. Here, we will evaluate if eSCIFI using the hybrid model can better cope with the holiday demand than SEAR, ACE and ECMA. We compare Brazil's Independence Day public holiday (Friday, September 7) and the Friday used in our regular week. We compared the mechanism using the parameters that gave the best normalized energy saving factor and smallest coverage ratio loss (*Tmin* = 54 and *tw* = 120 min).

Figure 7 shows the normalized energy saving factor achieved by the different mechanisms. As we can see, eSCIFI using both clustering algorithms and SEAR kept the normalized energy saving factor stable. eSCIFI using cSCIFI+ has the biggest normalized energy saving factor for holiday and weekdays. Baseline and ECMA also remain with their normalized energy saving factor unchanged. For Baseline, this happens because the decision is only based on time schedules and not on traffic demand estimations and therefore is unaffected. For ECMA, this result happened because the traffic demand for holiday or weekday remains unchanged, which did not change the SEAR *APs* switching on/off schedule during night hours. Only ACE had a reduction on the normalized energy saving factor in our holiday evaluation.

**Figure 7.** Normalized Energy saving factor comparison between a holiday and a weekday.

As we have seen in Figure 2, the demand on that holiday (Friday, September 7) was much smaller than the demand presented for the regular weekday (Friday, September 28). Figure 2 also shows that the holiday demand predicted by the hybrid model is much bigger than the real one, differently from the regular Friday where the hybrid model prediction was very close to the real traffic. Those results would first suggest that the normalized energy saving factor achieved by eSCIFI and SEAR for the public holiday should have been smaller as it happened with ACE. However, Figure 2 shows that the hybrid model wrong estimations have not even reached 200 associated devices for the whole network in any moment of the day on September 7. Figure 2 also shows that the regular Friday has not even reached 500 associated devices for the whole network on September 28. Those association values are very low considering that *Tmax* = 300. Therefore, we can presume that, for the evaluated regular and holiday Friday, the network is working with the minimum set of *APs* switched on (only the cluster heads) and that is the reason why SEAR and eSCIFI using both clustering algorithms have their normalized energy saving factor unchanged. In fact, we analyzed how the real data would affect the normalized energy saving factor results in that analysis for the mechanisms and it showed that it would not have changed much (less than 3.3% for all mechanisms) in the results.

Figure 8 shows the mechanisms' coverage ratio loss for the regular weekday and for the holiday. Only ACE and Baseline present some energy loss since they are the only mechanisms that do not have a coverage guarantee. The Baseline coverage loss remains the same, which shows that the traffic demand for night hours (0 a.m–6:59 a.m) on both our holiday or weekday remains unchanged. The smaller coverage ratio loss for our holiday when compared to our weekday on the ACE mechanism case can be explained by the smaller traffic demand estimated for the whole day. Another explanation for the ACE mechanism reduced coverage ratio can be on the fact that the ACE mechanism has a smaller normalized energy saving factor on our holiday, which means it has a smaller number of *APs* switched off or that they are switched off for a short period of time.

**Figure 8.** Coverage ratio loss comparison between a holiday and a weekday.

The results shown in this section cannot give us a precise conclusion on whether or not our algorithm could cope with the holiday demand without sacrificing the normalized energy saving factor. A large number of holidays in distinct weekdays and with distinct demand estimations would be necessary to understand it better. However, results indicated that the mechanism performance on holidays is not related to any change on its function, but it is in fact intimately related to the correct traffic estimations given by the hybrid model when compared to the real traffic data.

#### *4.4. SEAR vs. eSCIFI Clustering Algorithms*

As we have seen in Section 4.1, SEAR had a better normalized energy saving factor result than eSCIFI using the cSCIFI clustering algorithm. However, the clustering algorithm developed by Jardosh et al. [2] does not have the same optimizations criterion we have implemented on our both algorithms. Therefore, the work of Jardosh et al. [2] is susceptible to the order of appearance of *APs* in the neighborhood list of other *APs*. This order affects which will be the next *APs* selected by the Jardosh et al. [2] green clustering algorithm to fill the cluster in case of ties between the number of neighbors. The order of appearance of *APs* in the neighborhood list impacts its result since it does not have any tie breaker rule in the selection of the next *AP* to be put in the cluster in case of a tie in the number of neighbors between *APs*. cSCIFI and cSCIFI+ do not have this disadvantage, and therefore we can guarantee that the clusters formed will not depend on the order of appearance. Here, we will compare the normalized energy saving factor achieved by the SEAR and eSCIFI mechanism using both clustering algorithm using 3 different orders of appearance of *APs* in the neighborhood lists of the *APs*. The two first neighborhood lists present cases where the *APs* position inside the neighborhood lists are randomized and the third represent the neighborhood list we have used for all tests we have done before for the SEAR mechanism. We will compare the SEAR and eSCIFI mechanism using both clustering algorithms with the parameters that gave the best normalized energy saving factor and no coverage ratio loss (*Tmin* = 54 and *tw* = 120).

As we can see in Figure 9, SEAR energy saving result is heavily affected by the order of appearance of *APs* in the neighborhood list, while eSCIFI using cSCIFI and cSCIFI+ are not affected at all. This result shows that the changes we have implemented on cSCIFI and cSCIFI+ have turned our algorithm unaffected by the order of appearance of *AP* in the neighborhood list. This is a clear advantage since it will not require an optimization on the neighborhood list formation process that in a huge network topology might be unpractical to be done.

**Figure 9.** Normalized Energy saving factor comparison between SEAR and eSCIFI using both clustering algorithm with different neighborhood lists.

#### *4.5. Best Results Comparison*

Table 4 shows the best energy saving factor and coverage ratio loss results that can be achieved by the distinct mechanisms for our trace-driven test. Each mechanism uses distinct algorithms with a set of parameters as we have seen in the previous sections. As seen in Table 4, eSCIFI using cSCIFI+ achieved the highest energy saving factor among all mechanisms (64.32%) while presenting 0% coverage ratio loss.

ECMA, SEAR and eSCIFI using cSCIFI also achieved 0% coverage ratio loss. However, those mechanisms achieved lower energy saving factors when compared to the eSCIFI mechanism using cSCIFI+. SEAR got similar results to cSCIFI+ and better results than eSCIFI using cSCIFI. However, as explained previously and shown in Figure 9, SEAR energy saving result is affected by the neighborhood list ordering, while eSCIFI proposals using cSCIFI or cSCIFI+ are not affected at all. Therefore, our results show that eSCIFI using the cSCIFI+ algorithm achieved the best energy saving and coverage ratio loss results for our scenario.

**Table 4.** Mechanism's best results comparison.


#### **5. Conclusions**

We presented the eSCIFI energy saving mechanism and its main architecture. eSCIFI uses traffic demand estimations given by machine learning models to manage the energy state of *APs* and it was designed to cope with a broader variety of wireless networks, specially those that cannot collect traffic data in a real time manner and/or have a limited CPU power.

We evaluated the normalized energy saving factor and the coverage ratio loss of our proposed mechanism. We also reproduced and compared eSCIFI results to the ones achieved by ACE, ECMA and SEAR. Those results showed that, for the UFF SCIFI network scenario, eSCIFI produced the best results. The best energy saving mechanism was the eS-CIFI using the cSCIFI+ mechanism that can save up to 64.32% of the total energy consumed in a week without affecting the network coverage and user's association capacity.

eSCIFI has not been tested and implemented in real network scenarios yet. As future work, we plan to implement eSCIFI on the UFF SCIFI controller and do some future experiments using the real network infrastructure. The practical usage will give us some real insights about how to properly tune eSCIFI parameters according to a real implementation. We also plan to use more sophisticated metrics as the average throughput and delay to evaluate the network performance and user coverage on those real network tests. We hope that those tests and new features will allow us to fully understand the eSCIFI possibilities and overcome its limitations.

**Author Contributions:** Conceptualization, G.H.A. and D.C.M.-S.; methodology, G.H.A., F.B. and D.C.M.-S.; validation, G.H.A., F.B., D.C.M.-S. and L.C.S.M.; formal analysis, G.H.A.; investigation, G.H.A.; resources, G.H.A. and L.C.S.M.; data curation, G.H.A., F.B. and L.C.S.M.; writing—original draft preparation, G.H.A. and D.C.M.-S.; writing—review and editing, G.H.A., F.B., D.C.M.-S. and L.C.S.M.; supervision, F.B., D.C.M.-S. and L.C.S.M.; project administration, D.C.M.-S.; funding acquisition, D.C.M.-S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was partially funded by the Research Foundation of the State of Rio de Janeiro (FAPERJ), the Research Foundation of the State of São Paulo (FAPESP), the Coordination for the Improvement of Higher Education Personnel (CAPES), CAPES PRINT, and the Brazilian National Council for Scientific and Technological Development (CNPq).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The UFF SCIFI network association datasets (for classification and regression) are available at https://github.com/midiacom/UFF-SCIFI-Datasets, accessed on 3 December 2021.

**Acknowledgments:** We thank FAPERJ, FAPESP, CAPES, CAPES PRINT, and CNPq for their financial support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **The Impact of Climate Change on a University Campus' Energy Use: Use of Machine Learning and Building Characteristics**

**Haekyung Im 1,\*, Ravi S. Srinivasan 1, Daniel Maxwell 2, Ruth L. Steiner <sup>3</sup> and Sayar Karmakar <sup>4</sup>**


**Abstract:** Global warming is expected to increase 1.5 ◦C between 2030 and 2052. This may lead to an increase in building energy consumption. With the changing climate, university campuses need to prepare to mitigate risks with building energy forecasting models. Although many scholars have developed buildings energy models (BEMs), only a few have focused on the interpretation of the meaning of BEM, including climate change and its impacts. Additionally, despite several review papers on BEMs, there is no comprehensive guideline indicating which variables are appropriate to use to explain building energy consumption. This study developed building energy prediction models by using statistical analysis: multivariate regression models, multiple linear regression (MLR) models, and relative importance analysis. The outputs are electricity (ELC) and steam (STM) consumption. The independent variables used as inputs are building characteristics, temporal variables, and meteorological variables. Results showed that categorizing the campus buildings by building type is critical, and the equipment power density is the most important factor for ELC consumption, while the heating degree is the most critical factor for STM consumption. The laboratory building type is the most STM-consumed building type, so it needs to be monitored closely. The prediction models give an insight into which building factors remain essential and applicable to campus building policy and campus action plans. Increasing STM is to raise awareness of the severity of climate change through future weather scenarios.

**Keywords:** building energy modelling; regression analysis; machine learning; climate change; university campus; energy consumption prediction

#### **1. Introduction**

Even though accelerating issues on climate change are arousing people's awareness, many are still uninformed about how climate change affects building energy consumption. Building operations are responsible for 28% of total emissions, while embodied carbon, which is from building materials and construction, is responsible for an additional 11% annually [1]. This is nearly 40% of CO2 emissions coming from the building and building construction sector, which is responsible for over one-third of global energy consumption [2]. Global warming—a product of climate change—is also expected to significantly increase building energy use for cooling. Climate change phenomena are not uniform across the globe, so energy consumption direction or protocol can differ by climate zone. Therefore, this study focuses on the energy use of a Philadelphia (PA)-based university's campus under climate zone 4A, comprised of a humid subtropical climate [3].

College and university campuses use an average of 18.9 kilowatt-hours (kWh) of electricity (ELC) and 17 cubic feet (ft2) of natural gas per square foot annually [4]. Universi-

**Citation:** Im, H.; Srinivasan, R.S.; Maxwell, D.; Steiner, R.L.; Karmakar, S. The Impact of Climate Change on a University Campus' Energy Use: Use of Machine Learning and Building Characteristics. *Buildings* **2022**, *12*, 108. https://doi.org/10.3390/ buildings12020108

Academic Editors: Shi-Jie Cao and Wei Feng

Received: 13 December 2021 Accepted: 19 January 2022 Published: 23 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

ties have consumed a large portion of energy use and have updated their sustainability plan to reduce greenhouse gas emissions and energy consumption. Universities serve as pioneers for green societies and make immediate decisions and policies to address climate change, compared to other national-level organizations. Since university campuses include various building types, such as libraries, offices, laboratories, hospitals, and housing, these installations remain uniquely positioned to analyze the energy consumption of "mixed-use" group buildings. Essentially, a college or university campus serves as an independent district. Thus, typical university campuses can offer informative representations of urban energy consumption.

To minimize energy consumption, building energy consumption patterns need to be monitored and analyzed. By doing so, universities identify ways to effectively manage energy use and plan building renovation with an understanding about the relationship between building characteristics and energy consumption. In addition, energy consumption prediction can be beneficial for universities to forecast reliable energy budgets and identify opportunities of energy conservation [5].

This study investigates global climate change's influence on building energy consumption, the building's behavior, and other energy-related factors on campus. In this case study, ELC and steam (STM) were considered. ELC was provided by Penn Power, and the STM distribution system consisted of underground pipelines. Building characteristics, meteorological variables, and temporal variables were used to develop BEMs with a bottom-up approach for ELC and STM consumption. Among various bottom-up approaches, a multivariate regression and multiple linear regression (MLR) were chosen.

Thanks to rapid technology development, electronic appliance efficiency is improving, and construction materials are evolving for better insulation. In addition, smart homes and intelligent cities are becoming more popular, and the number of electronic devices people own is rising fast. To eliminate complex interventions, the impact of climate change on the energy consumption of existing buildings is analyzed purely by excluding technological advances. Furthermore, occupancy behavior was not considered in the focus on weather features and building characteristics. Consequently, operation features such as set point temperature, fan schedules, HVAC schedule (heating and cooling), lighting schedule, class schedule, building-operation hours, and building use schedule (occupancy schedule) were excluded.

#### **2. Literature Review**

To determine the method for this research, the statistical analysis used in BEMs were examined. Additionally, research on the BEMs of university campuses were analyzed to narrow the target. Then, the status of climate change and how global warming will affect building energy consumption were investigated. Finally, based on previous literature reviews, variable selections from other research allowed us to decide how to structure the data frame. Additionally, via literature reviews, this study intends to investigate factors that affect energy consumption the through interpretation of statistical models, with the following research questions (RQ):

RQ1. What are the appropriate building- and weather-related variables to be included in the predictive model?

RQ2. What is the relationship between input variables and energy consumption (output) by energy types considered by relative importance?

RQ3. How will each energy type change with the effects of global warming?

#### *2.1. BEMs Focused on Statistical Analysis*

Several review articles have summarized and defined Urban Building Energy Modeling (UBEM) [6,7]. The prevailing UBEM has two opposite modeling approaches: top-down or bottom-up. The top-down approach works at an aggregated level, typically aimed at fitting historical timelines based on national energy consumption while the bottom-up approach is built up from data as disaggregated components.

Since this study is focused on the bottom-up approach, this paper specifically discusses regression modeling approaches. Baker and Rylatt used clustering, simple regression, and MLR [8]. Kavousian et al. used stepwise selection to choose predictors in a MLR model because all input variables are numeric, not categorical [9]. Hsu used a Bayesian multilevel regression model to analyze the value of different measurements for predicting energy use and found that benchmarking data alone explains energy use as well as benchmarking and auditing data together [10]. Zeng et al. considered the multivariate regression model to be the best method in the case study with simplified inputs related to building energy [11]. Walter and Sohn also developed a multivariate regression model to predict Energy Unit Intensity (EUI) by using numerical predictors and categorical indicator variables [12]. For example, numerical predictors were operating hours, occupant density, etc., and categorical indicator variables were climate zone, heating system type, etc. This model measured the contribution of building characteristics and systems-to-energy use based on the crossvalidation (CV) approach.

Most previous research in BEMs has failed to interpret the BEM, used insufficient samples, or missed some important variables that are equipment power density (EPD) or meteorological variables. Most research has focused on maximizing performance accuracy of machine learning (ML) models and finding the best model by comparing different ML techniques. This research has shown the improvement of the ML models' accuracy. According to Kikumoto et al., as a result of increasing temperature, energy simulation showed a 15% increase in the heat load of two residential buildings [13]. A small sample size for application made it difficult to generalize future energy consumption. However, protocol or interpretations of the final ML models were frequently missing, so this study aims to interpret the regression models with a larger sample size.

#### *2.2. BEMs Focused on University Campuses*

Owing to the trend of growing energy consumption, Hong et al. investigated the energy waste in universities and suggested optimization options of campus buildings in South Korea [14]. The amount of ELC had risen about 19.7% over three years because the equipment used for heat had increased. Hong et al. also found that higher monthly average temperatures led to more A/C use on campus [15]. They developed several energy saving scenarios to suggest possible solutions. Chung and Rhee observed that equipment loads and occupancy schedules of university buildings for education and research are difficult to control [16]. However, they found that universities have a high potential to reduce energy losses caused by unnecessary energy consumption, low thermal performance, and airtightness. Because of the many random users rather than the operators themselves in the university buildings, retrofitting the existing buildings into low-energy buildings is crucial [14].

Guan et al. analyzed the ELC, heating, and water usage of campus buildings in Norway, which has a similar climate with climate zone category 5A [17]. They said that UBEM plays a critical role in learning about the efficient energy planning of future urban energy systems and smart systems. Several Chinese universities and colleges have initiated sustainable education and developed incentive policies to encourage students and faculty members to save energy [15]. Universities in the U.S. have taken actions on energy policies and movement for green campus development.

#### *2.3. Climate Change*

Recently, more studies have aimed at the future impact on energy consumption owing to climate change. Examples include energy use over long-term climate change for use in life cycle assessment applications with nine typical Florida residential houses [18]. Additionally, Fathi et al. and Fathi and Srinivasan expanded the sample size and targeted a university campus instead of residential buildings, wherein Principal Component Analysis (PCA) and Autoregressive Integrated Moving Average (ARIMA) techniques were selected for energy prediction with climate change [19,20]. Another example is

that Godoy-Shimizu et al. estimated urban-scale energy consumption using physics-based models for future weather [21].

Fumo and Rafe Biswas (2015) predicted total ELC using only outdoor dry-bulb temperature and added the global horizontal radiation to produce MLR models without considering building characteristics [22]. Higher resolution for the time interval, which was hourly data, leads to models with a lower quality than regression models using daily data. All three regression models (simple linear, simple quadratic, and MLR) with daily resolution had better R-square values but lower accuracies based on root mean square error (RMSE) than with hourly resolution.

Mohammadiziazi and Bilec used RF to analyze office buildings' EUI due to climate change [23]. They found that energy consumption will increase between 8.9% and 63.1% compared to the 2012 baseline for different geographic regions between 2030 and 2080. Campagna and Fiorito found that 65% of studies on climate change impacts on building energy consumption focused on climate zone C [24]. Therefore, studying climate zone A in this study can contribute to mitigating biased sample selection.

#### **3. Methods**

In this study, the hourly energy consumption data were obtained, which consist of ELC and STM. To answer the research questions, the BEMs' development process follows six-steps, namely: (1) set data frame with processing, (2) conduct descriptive statistical analysis, (3) select variables including relative importance analysis, (4) develop multivariate regression models, (5) develop multiple linear regression (MLR) models, (6) estimate energy use of a university campus with future weather data, and (7) analyze university campus' action plans (Figure 1).

**Figure 1.** Conceptual framework.

Step 1 is setting data frame with processing. Four separate datasets (energy consumption variables, building variables, meteorological variables, and temporal variables) were merged to form one complete dataset. After collecting and cleaning all the data, the merged data was divided into a training and a testing dataset. The training dataset is from 1 July 2015 to 30 June 2016. For validation, a 10-fold cross-validation was conducted. The testing dataset is from 1 July 2016 to 14 August 2016.

Step 2 is conducting a descriptive statistical analysis. To observe the distribution, density plots, boxplots, histogram, skewness, and kurtosis were used. To understand the data and to detect outliers, scatter plots, three-dimensional (3D) plots, relative importance, correlation coefficient (CC), partial CC, and CC matrix plots are used. Additionally, central

tendency (mean, mode, and median), measures of variability (range and interquartile range), variance, and standard deviation were measured.

Step 3 is selecting variables, including importance analysis. The multivariate regression method was used to determine independent variables because all buildings share the same independent variables. Based on other literatures and experts in machine learnings and statistics, different variables for multivariate regression models were used and compared to each other based on R2 and RMSE. Four sets of multivariate regression models were developed, consisting of two linear regression models for ELC and STM. For reliable predictions, a large amount of historical data is required. Thus, the hourly data of 18 buildings with 22 input (independent) variables and 2 (dependent) variables were each observed and used to meet the requirement to develop the multivariate regression models.

The variables were analyzed based on CC, partial CC, and relative importance. To check correlations between variables, three CC measuring methods were used: (1) Pearson CC with the linear dependence between two numeric variables, (2) Spearman for polynomial relationship, and (3) Kendall between categorical input variables and numeric output variables. Kendall's tau and Spearman's rho were used to estimate a rank-based measure of association. To examine the categorical variables, we referred to relative importance as well. For relative importance, rank bootstrap confidence intervals were obtained by using the percentile method. Bootstraps were replicated 100 times in order to calculate confidence intervals. Metrics for relative importance are normalized to sum to 100%. This was used for all numeric variables and meteorological variables alone to examine the influence of variables on energy consumption. Lastly, partial CCs were measured to select meaningful variables.

Step 4 is developing multivariate regression models. From MLR development, variables were selected to finalize the multivariate regression model. As shown in Equation (1), independent numeric variables were standardized with a mean (μ) of 0 and standard deviation (σ) of 1 (unit variance):

$$
\lambda' = (\lambda - \mu) / \sigma \tag{1}
$$

Step 5 is developing MLR Models BEMs. As a feature selection method, stepwise selection methods were used. The performance was measured through Mean Absolute Error (MAE), RMSE, and adjusted R square (R2). MAE is a key performance indicator (KPI) for measuring forecast accuracy. These measures are used to check the regression model's accuracy to validate and test and can be calculated using Equations (2)–(4):

$$\text{MAE} = \frac{1}{n} \sum\_{i=1}^{n} \left| y\_{predict,i} - y\_{data,i} \right| \tag{2}$$

$$\text{RMSE} = \sqrt{\frac{\sum\_{i=1}^{n} \left( y\_{predict,i} - y\_{data,i} \right)^{2}}{n}} \tag{3}$$

$$\mathcal{R}^2 = 1 - \frac{\sum\_{i=1}^n \left( y\_{predict,i} - y\_{data,i} \right)^2}{\sum\_{i=1}^n \left( y\_{data,i} - y\_{data,i} \right)^2} \tag{4}$$

According to Hoff and Perez, MAE is commonly accepted as a measure of dispersion because of its lesser sensitivity to distant outliers and lesser subjection to interpretation when expressed in relative (percentage) terms [25]. A 10-fold CV was used to validate BEMs and, to test the models, regression models' accuracies were checked with a validation dataset. Based on MLR models, significant variables for each energy consumption type are revealed.

The sixth step is to estimate the energy use of university campuses with future weather data. As the last step, to estimate the operational energy consumption under long-term climate change, the average values from the hottest scenario representing 2054 were used

as inputs into the MLR models. Likewise, the average values of building characteristics were used to predict energy consumption in 2054. Lastly, Step 7 is to analyze university campus' action plans to learn lessons explained in Section 5.6. Specifically, in my research, energy action plans of 18 universities were analyzed to show how a leading energy group deals with climate change.

#### **4. Variable Selection**

From Step 1 to Step 3, potential independent variables were scrutinized to build BEMs to explain energy consumptions. This section demonstrates details about the variable selection process, which includes a literature review and the analysis of results.

It should be noted that Kikumoto et al. did not include the building characteristics in predicting heat loads [13]. Whereas Im et al. concentrated on interpreting the regression models to explain the relationships among building characteristics, energy consumption, and weather [26,27], Im et al. (2019) developed polynomial regression models on chilled water (CHW) and ELC consumption of campus buildings by comparing BEMs with hourly and daily data [24]. Additionally, Im et al. (2020) attempted to predict CHW with lasso regression models using future weather data [25]. However, both studies did not use solar radiation data, which is considered a significant factor in BEM [28]. In conclusion, this study focused on variable selection and the interpretation of BEMs with larger sample sizes, as well as additional building characteristics and weather data, including solar radiation. Considering ELC and STM allows us to grasp campus energy consumption comprehensively.

Out of 194 buildings, 48 buildings had one year of two energy consumption information: ELC and STM. Out of 48 buildings, 18 buildings' data were used after excluding buildings missing building characteristics data to develop building energy modeling (BEM) by adopting multivariate regression models with a 14.5-month period. Developing multivariate regression models is to achieve a comprehensive understanding of energy use depending on the energy type by comparing the variables' impact. The sample size could be larger when final regression models were developed separately, like 43 buildings for ELC and 32 buildings for STM (Table 1). From these separate data frames, building types with less than three buildings were eliminated because of the small size in the number of buildings. As a result, all food and health-related buildings were eliminated. Eventually, 39 buildings for ELC and 30 buildings for STM were secured for multiple regression model analysis with the same time frame.


**Table 1.** Building number by energy types and building types.

#### *4.1. Energy Consumption Variables*

Energy consumption variables were comprised of electricity consumption (ELC, kilo-British Thermal Unit (kBTU)/Gross Square Feet (GSF)) and steam consumption (STM, kBTU/GSF). Gross square feet (GSF) is highly related to several variables that have a value per unit square feet, such as U-values, EPD, and LPD. Therefore, GSF was included as a denominator of the dependent variable by dividing energy consumption data in kBTU

with GSF. Equipment, lighting, and plug loads are three main categories contributing to the energy consumptions in the buildings. Equipment, such as heating, ventilation, and air conditioning (HVAC) systems and water heaters, is the other main internal load, which contributes to STM in this case study. Lighting and plug loads contribute to ELC. Plug loads are growing as electronics become more pervasive with the ever-accelerating progress of technology, even in construction, which used to be known as a conservative field.

#### *4.2. Building Variables*

Building variables comprised of building thermo-physical properties and other power densities. Building variables included the U-value of Wall (U-Wall, Btu/h ◦F.ft2), U-value of Windows (U-Windows, Btu/h ◦F.ft2), U-value of Roof (U-Roof, Btu/h ◦F.ft2), Window-Wall Ratio (WWR), building height (feet), construction year (year), Building Age (year), Renovation Age (year), equipment power density (EPD, W/ft2), and Lighting Power Density (LPD, W/ft2).

Based on the regression and the plot, the beta coefficient (β: slope of the regression models) shows that ELC consumption and Building Age are statistically, negatively related (Figure 2). The initial plot with Building Age could lead to the misinterpretation of energy consumption, which indicates that buildings consume less energy as time goes by, regardless of the deterioration of the building. Thus, it is important to use construction year instead of building age. As shown in Figure 3, ELC consumption is higher in newer buildings than in older buildings. This may be because buildings are evolving with advanced technology and more outlets. Newer buildings have more opportunity (more outlets or electronic appliances) of energy use compared to older buildings.

**Figure 2.** Scatter plot of electricity consumption and Building Age.

**Figure 3.** Scatter plot of electricity consumption and construction year.

Gao et al. selected four variables (wall area, building height, building orientation, and glazing area) through the feature reduction process to improve the performance of models out of eight independent variables (relative compactness, surface area, wall area, roof area, building height, orientation, glazing area, and glazing area distribution of a residential building) [29]. Compared to the study by Gao et al., in this study, window wall ratios (WWR) were used, which is a combination of wall area and glazing. Building orientation was not used.

Godoy-Shimizu (2018) mentioned that several studies considered building height to estimate the building energy performance [18]. Additionally, a survey of the literature showed no previous studies used building height and the number of floors at the same time. Even though Capozzoli et al. considered both building height and the number of floors to predict heating energy consumption in schools, they a had too-low Pearson CC to include them in the MLR model [30]. Therefore, they excluded both variables in the MLR model because of the low CC.

Im et al. (2019) used the number of floors, the building height, and their interaction term in previous research and found that they are significant considering the *p*-value [21]. However, the number of floors was removed from the regression model due to high multicollinearity (Pearson CC: 0.9). The building height improved the regression models better than the number of floors. Using both building height and the number of floors might be worthy to research in commercial building because of its variety of space and height. Lastly, the year of building renovation was never used for building energy prediction. Furthermore, Construction Year was used to be the input of the BEM.

#### *4.3. Meteorological Variables*

Meteorological variables included solar radiation, outdoor air temperature, heating degree (HD), cooling degree (CD), relative humidity, pressure, and wind speed. The future meteorological data was obtained through the open source from the nonprofit energy weather research organization the Slipstream Group [25]. They developed the weather scenarios as a representative location for each ASHRAE climate zone with their proprietary algorithm. The mentioned algorithm uses raw climate data for future weather from the NARCCAP (North American Regional Climate Change Assessment Program). Future

weather scenarios for Pennsylvania (PA) were unavailable; however, those scenarios for Baltimore, Maryland (MD) were available, which is in the same climate zone 4 and moist (A) category as PA. Predicted Maryland weather data has three future weather scenarios coldest (2047), average, and hottest (2054). The hottest weather scenario of MD was chosen to consider the biggest impact on global warming.

The preliminary studies by Im et al. used two independent meteorological variables: (1) outdoor air temperature and (2) relative humidity, representing years 2015, 2016, and 2054 [26,27]. Common factors were selected among 15 historical meteorological variables and 11 future meteorological variables. There were four climate variables in common: temperature, humidity, pressure, and wind speed. Fathi and Srinivasan (2019) used temperature, solar radiation, and humidity as meteorological factors [20]. Daut et al. (2012) also revealed a strong linear relationship between solar radiation and surface temperature [26]. Therefore, solar radiation was added to the four climate variables as a final list in this study. HD and CD were generated from temperature with the setpoint standard 18.33 ◦C (65 ◦F). Because of the high multicollinearity, temperature and HD/CD cannot be used together.

#### *4.4. Temporal Variables*

Temporal variables included the number of the week throughout the year (week's mumber), type of day (weekday, Saturday, or Sunday), numeric hour, categorical hour (0–23), hour type (working, evening, or night), business day type (business day or nonbusiness day), and season type (spring, summer, fall, or winter).

According to Wang and Srinivasan, some researchers utilized occupancy indicators such as the time of day and day type [31]. For example, Dong et al. used the time of the day as a categorical variable to predict building energy consumption [28]. Dong et al. and Kotchen used the month of the year as a categorical variable [32,33]. Based on literature review and an expert's opinion in Information Systems and Operations Management, in this study, temporal variables used as occupancy indicators are type of day, numeric hour, categorical hour, hour type, and business day type. These temporal variables allow for observation of the occupancy condition and pattern by remedying the missing occupancy information. Boiron et al. used the month, day of the week, and hour to develop a regression model [34]. This can be effective for observing behavior patterns at residential buildings, but it does not reflect change over time despite slight improvements in R2 of campus data. Additionally, there is less of an opportunity to change human behavior in the campus setting than to change the technology and machinery being used by the buildings. Season type was used instead of the month of the year in this study to consider the campus's characteristics. As a result, data were grouped into four seasons based on the academic calendar.

Both ELC and STM are consumed by occupants, so the presence of the users makes the difference in energy consumption. Therefore, holiday and non-holiday were separated as categorical variables. In addition, electricity consumption is for other uses rather than cooling and heating such as appliances, electronics, and lab equipment.

#### **5. Results**

#### *5.1. Process of Variable Selection (Multivariate Regression Model Development)*

Multivariate regression models were developed with 177,408 observations to estimate the ELC and STM consumption (Table 2). All numeric predictors were normalized to compare the β coefficient regardless of the unit of the variables. These models were developed with four sets of different variables. Variables in Test 1 consist of 15 numeric variables and one nominal variable. Variables in Test 2 consist of 15 numeric variables, a nominal variable, and one ordinal variable. Variables in Test 3 consist of 15 numeric variables and one nominal variable. Variables in Test 4 consist of 18 numeric variables including three interaction terms, two nominal variables, and two ordinal variables. R2 and RMSE in Table 2 are averaged values of two models for ELC and STM. Detailed information with β coefficients is shown in Appendix A.


**Table 2.** Different Variables Sets Input in Multivariate regression models.

There was an overlap between time variables and weather information, which reflects seasonal changes. Meteorological variables' change, construction year, and Renovation Age are reflected by time. Therefore, year, month of the year, and day were excluded for regression analysis. Adding Categorical Hour and Week's Number to Test 2, instead of only the Date as in relative importance with bootstrap confidence interval were analyzed with a one-year train dataset.

Test 1 improved the ELC and STM models' accuracy based on RMSE. Week's number was used in Test 2 and Test 3 to observe the chronological order. Because of the continuous cyclical feature of the hour, the numeric hour was used in Test 3 and transformed with sine and cosine. Then, in both cases (ELC and STM), the models' accuracy and R2 improved. As a final step, new variables were added based on expert feedback in statistics, and some variables were eliminated. On top of three interaction terms (X22: X1U-Window × X4WWR, X23: X6-1Construction Year × X6-3Renov., X24: X1U-Wall × X4 WWR), four new variables are construction year instead of building age, hour type, building types, and business day type. Business day type was used to consider holidays instead of the type of day. The temperature was used initially in Test 1 through Test 3, but the R<sup>2</sup> and RMSE showed improvement in terms of the statistical model's accuracy and explanation of energy consumption when HD and CD were considered instead of temperature. The last change in variables enhances all two models significantly. Inaccuracy of ELC MLR were increased based on higher RMSE, but R<sup>2</sup> was improved.

Based on the result of Test 2, among temporal variables, the hour variable was analyzed as a categorical variable. The MLR model for ELC, 6 am–10 pm, showed a significant *p*- value while 11 pm–5 am was insignificant, which was the nighttime, when people were asleep. For STM, the days of the week and 4 am–8 pm were significant. The rest of the hour variables were not. Descriptive data analysis is performed to capture the behavior of energy consumption over time. The hour variable is insignificant for STM, but a predictable pattern was observed in ELC (continuous increase from 3 am to 2 pm and continuous decrease during the rest of the time). Therefore, the hour variable was required by adopting sine and cosine as a periodical factor.

In conclusion, ELC usage had the longest hour timeframe that showed a meaningful difference with other times. STM was the next. According to the 2D plots, only ELC had an hourly consumption pattern as a cycle of the day, particularly owing to occupancy behavior rather than weather. Therefore, the hour variable for the ELC model, rather than STM, needed to be considered based on the findings in the multivariate regression model and descriptive analysis. Additionally, week number were insignificant for ELC. Thus, this study used multivariate analysis to compare the impact of variables for each type of energy consumption in order to grasp campus energy consumption comprehensively.

A lower U-value is better insulation for energy consumption. This applies to WWR because a high window rate compared to the wall can increase the chance of infiltration, leading to more cooling in the summer or heating in the winter season. Therefore, positive relationships are expected between the U-value of wall, the U-value of window, and the U-value of roof and energy consumption as well as between WWR and energy consumption. Final multivariate regression models were developed with 177,408 observations and 19 predictors with 3 interaction terms, consisting of 3 MLR models for ELC and STM (Table 3).


**Table 3.** Building number by energy types and building types.

Notes: Correlation is significant at different levels (2-tailed) as follows: \*\* 0.001 level and \* 0.01 level.

#### *5.2. Comparison of Multivariate Regression Model by Energy Type*

Among building variables, there were three outstanding β coefficients in the multivariate regression model (Table 4). ELC had a high β coefficient with EPD (1.58), compared with STM (1.24). STM had a positive β coefficient with a U-Window (0.61), opposite to ELC (−0.81). STM had a higher beta with building height (0.751), compared with meteorological ELC (0.349).


**Table 4.** Multivariate regression model (18 buildings) without standardization.

Among variables, solar radiation and wind speed were insignificant for ELC, considering the 0.05 level of *p*-values. Additionally, the week (Monday to Friday) was significant, but differentiating Saturday and Sunday was meaningless. Therefore, for ELC, categorizing the day of the week into business day or non-business day is better than the type of week. Both ELC and STM have negative β coefficients with temperature, which are −0.049 and −1.729, respectively.

#### *5.3. Result of MLR*

Cross-validation was conducted for the validation, and Table 5 shows the MLR models' results. The MLR model could explain 76.74% of the ELC consumption and 56.59% of STM based on R2. The R2 of ELC was the highest among the three energy consumption types for validation. Additionally, MAE, RMSE, and R<sup>2</sup> were measured to check the regression model's accuracy to test as shown in Table 6. ELC and STM had similar values in MAE (0.00 and 0.01) and RMSE (2.62 and 1.62). The R2s of model accuracy were ELC (88.28) and STM (74.47), respectively.


**Table 5.** Multiple linear regression.

**Table 6.** BEMs' accuracy test result of MLR.


#### *5.4. Relative Importance Analytics*

Relative importances with bootstrap confidence intervals (%) were analyzed with a one-year train dataset (177,408 observations), and metrics were normalized to sum to 100% (Figure 4).

**Figure 4.** Relative importance of all independent variables with 95% bootstrap confidence intervals: (**a**) electricity (ELC) and (**b**) steam (STM).

Meteorological variables explained 0.06% (relative importance: 0.36%) for ELC and 4.36% (relative importance: 20.51%) for STM as revealed from R2 of multivariate regression models. The relative contribution of solar radiation for ELC is 0.19% despite the low relative contribution rate as a most-related variable among weather variables. The relative contribution of temperature for ELC is 0.12%, which consists of CD: 0.11% and HD: 0.01%. The relationship between outdoor temperature and heating/cooling is obvious, but the relative importance analysis confirms the contribution of temperature to energy consumption. Based on relative importance analysis, HD is the most crucial predictor to predict STM. CD is the next crucial variable among meteorological variables, but other variables come prior to CD. When only weather variables were considered, relative contributions with 95% confidence intervals for HD are between 82.64% and 85.16% for STM. Relative contributions with 95% confidence intervals for CD are between 82.64% and 85.16% for STM. With other all variables, relative contributions for HD are 15.26% for STM. Relative contributions for CD are 3.07% for STM. Universities need to be aware of climate change because temperature-related variables are the top variable for STM. According to CD's rank for STM, only HD was a significantly important variable for STM consumption, and a temperature below 65 ◦F did not affect the heating significantly.

Proportions of variance explained by the models were the same as adjusted R2 of multivariate regression models. To answer RQ1, the top four variables that accounted for more than 50% of the relative importance were as follows: EPD, building type, building height, and construction year for ELC; HD, EPD, building height, and building type for STM. building height are common for both energy type. EPD, construction year and building type are common for ELC and STM. Therefore, universities should choose energy efficient EPD and decide on reasonable building height in the design phase to save ELC and STM consumption. Additionally, building energy should be analyzed by building type because the laboratory building type showed totally different patterns compared to other building types.

To be specific, energy consumption patterns by building type are analyzed as follows: through descriptive statistical analysis, the rapid increase of ELC consumption in the laboratory and the office was observed. One of them may be the increasing demand of computing work in laboratory and office buildings, because this trend is not present in lodging, education, and public assembly. Laboratories are the most energy consuming building type. The laboratory building type has a larger variance in STM consumption compared to other building types, and the laboratory requires much higher STM consumption, as a regression

model shows (Figure 5). This finding aligns with the findings by Ferguson et al., saying that priority should be given to buildings with high energy demands, such as research in university settings [35]. Therefore, the university needs to track the cause of energy consumption and replace it with energy efficient equipment (better EPD) or improve the insulation of building components for laboratory buildings. According to the regression model with temperature and STM consumption, laboratory buildings had the steepest slope, and office buildings had the second steepest slope. The steep slope means that more STM is consumed when the temperature is higher. Evaluating laboratory hoods is suggested to universities to improve energy efficiency, user operations, or to arrange for removing unneeded equipment [35]. Washington University St. Louis implemented low-flow fume hoods with hood occupancy controls, which led to a 40% reduction in energy use [36]. Purchasing Energy Star equipment for its offices and laboratories can be challenging for individual faculty, so universities may promote it financially or provide an endorsement of the products.

**Figure 5.** Scatter plot of STM consumption by temperature.

The education building type consumes STM with the smallest variance compared to other building types with the least outliers. The small variance means that estimating STM of the lodge building type is easier than other building types because it is clustered, which causes less error in prediction models.

#### *5.5. Result of Correlation Coefficient (CC) and the Partial CC*

Both CC and the partial CC were considered for analysis. A partial CC was run to determine the relationship between an individual variable and each energy consumption while controlling for the rest of numeric variables. For ELC, EPD has CC (0.505) and partial CC (0.458). U-Roof has CC (0.438) and partial CC (0.358), respectively. The relationships between EPD and ELC and between U-Roof and ELC show a direct relationship with little difference when comparing the CC and the partial CC.

For STM, the CC and partial CC of building height (0.521; 0.130), construction year (−0.429; −0.115), and EPD (0.546; 0.157) indicate that the rest of the variables had a very large influence in controlling for the relationship between these three variables and STM. In

other words, these three variables are related with other variables closely, so direct relation with STM is relatively weak. There was a moderate, positive CC between HD and STM. All the results of CC, partial CC, and relative importance indicated that temperature is a significant variable for STM and has a direct relation with STM.

Bruan et al. (2014) found that temperature has the highest influence on building energy consumption with R2 being equal to 0.92 for ELC and R<sup>2</sup> being equal to 0.85 for gas consumption [37]. Unlike Bruan et al.'s study, temperature was insignificant for ELC, while solar radiation was the most critical weather factor for ELC among meteorological variables. The end use needs to be compared for the appropriate comparison. Based on analyzing the relative importance of MLR models, the most effective key variables to answer for RQ1 about the appropriate building- and weather-related variables to be included in the predictive model are as follows: EPD, building type, building height, and construction year for ELC and HD, EPD, building height, and building type for STM.

Regression diagnostics plots were created to check the linearity, normality of residuals, homogeneity or residuals variance, and independence of residuals error terms. The null hypothesis of the studentized Breusch–Pagan test (BPtest) was that the residuals have constant variance. So, a *p*-value less than 0.05 would mean that the homoscedasticity assumption would have to be rejected. Both final models passed the BPtest. Furthermore, an increase in the STM consumption was observed because it is influenced by temperature (Figures 4–6).

**Figure 6.** Boxplots for STM prediction by building type in 2047 with coldest scenario and 2054 with hottest scenario.

As observed in β coefficient analysis, the weather variable does not influence ELC consumption when STM and CHW are used as another energy source. Based on Figure 6, STM consumption will increase with the humid subtropical climate in both the hottest and coldest scenarios. Climate change will contribute to increasing STM for heating as well as countering the effect of global warming, which can easily be neglected.

#### *5.6. Retro-Commissioning of University Campuses*

Universities tend to respond to the challenge of global climate change more effectively as educational institutions compared to other building types. Other building types try to meet the requirements of building energy standards passively and inexpensively when energy policies change. Therefore, checking how universities respond to climate change can be meaningful to glimpse the trajectory for the overall building sector.

When prioritizing and implementing the deferred maintenance program, building operating efficiency and optimization should be fully considered [38]. The monitoring results lead to selecting energy-intensive buildings, which consume more campus energy than other buildings [39]. Based on this selection of the buildings, one should conduct energy audits and retro-commission them. The potential to save energy can be different from the amount of energy consumption. Thus, during the selecting process and retrocommissioning, potential factors that can bring significant change in energy saving need to be examined.

Most energy saving-related information was a prediction, and not actual energy saving results in the energy action plans. These plans tend to be unevaluated, especially when they do not achieve their goal. Universities need to pay attention to actual results by their action plans to use as a reference to evolve energy saving strategies.

Virginia Tech found that 35 percent of all buildings (50 buildings) on campus accounted for over 70 percent of overall university energy costs. Thus, they chose the top ten buildings to focus on energy saving [40]. Boston University reduced its energy consumption by 4% through its energy saving plan after eight years from 2006 while growing in size by 14% [41]. University of Pennsylvania (UPenn) also upgraded lighting in 45 buildings [42]. At UPenn, carbon dioxide equivalent has reduced 32% from STM and 9% from ELC since 2014. Nonetheless, CHW increased 17%, but emitted the lowest carbon dioxide among these three energy types [42].

To save energy, many universities have established energy action plans. Commonly observed action items from 18 universities are shown in Table 7: (1) replacing lights with energy-efficient lights; (2) installing occupancy sensor; (3) equipment reinforcement; (4) upgrading windows; (5) implementing renewable energy system (e.g., solar panels); (6) renovating roofs as green roofs or insulating roofs; and (7) managing building automation (monitoring system).


**Table 7.** Summary of universities' energy efficiency strategies.
