Multi-Agent Deep Reinforcement Learning-Based HVAC and Electrochromic Window Control Framework

Chen, Hongjian; Sun, Duoyu; Sun, Yuyu; Zhang, Yong; Yang, Huan

doi:10.3390/buildings15173114

Open AccessArticle

Multi-Agent Deep Reinforcement Learning-Based HVAC and Electrochromic Window Control Framework

by

Hongjian Chen

¹

,

Duoyu Sun

²

,

Yuyu Sun

²,

Yong Zhang

¹

and

Huan Yang

^2,*

¹

School of Electronic Science and Engineering (School of Microelectronics), South China Normal University, Foshan 528225, China

²

School of Artificial Intelligence, South China Normal University, Foshan 528225, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(17), 3114; https://doi.org/10.3390/buildings15173114

Submission received: 1 August 2025 / Revised: 28 August 2025 / Accepted: 29 August 2025 / Published: 31 August 2025

(This article belongs to the Special Issue AI-Driven Cooling, Refrigeration, and Energy Solutions for Built Environments)

Download

Browse Figures

Versions Notes

Abstract

Deep reinforcement learning (DRL)-based HVAC control has shown clear advantages over rule-based and model predictive methods. However, most prior studies remain limited to HVAC-only optimization or simple coordination with operable windows. Such approaches do not adequately address buildings with fixed glazing systems—a common feature in high-rise offices—where the lack of operable windows restricts adaptive envelope interaction. To address this gap, this study proposes a multi-zone control framework that integrates HVAC systems with electrochromic windows (ECWs). The framework leverages the Q-value Mixing (QMIX) algorithm to dynamically coordinate ECW transmittance with HVAC setpoints, aiming to enhance energy efficiency and thermal comfort, particularly in high-consumption buildings such as offices. Its performance is evaluated using EnergyPlus simulations. The results show that the proposed approach reduces HVAC energy use by 19.8% compared to the DQN-based HVAC-only control and by 40.28% relative to conventional rule-based control (RBC). In comparison with leading multi-agent deep reinforcement learning (MADRL) algorithms, including MADQN, VDN, and MAPPO, the framework reduces HVAC energy consumption by 1–5% and maintains a thermal comfort violation rate (TCVR) of less than 1% with an average temperature variation of 0.35 °C. Moreover, the model demonstrates strong generalizability, achieving 16.58–58.12% energy savings across six distinct climatic regions—ranging from tropical (Singapore) to temperate (Beijing)—with up to 48.2% savings observed in Chengdu. Our framework indicates the potential of coordinating HVAC systems with ECWs in simulation, while also identifying limitations that need to be addressed for real-world deployment.

Keywords:

building energy saving; multi-agent deep reinforcement learning; electrochromic window; control methods

1. Introduction

According to the International Energy Agency (IEA) report, the building and construction sector accounted for 35% of global energy consumption in 2020 [1]. Among public buildings, heating, ventilation, and air conditioning (HVAC) systems are one of the largest energy consumers, accounting for about 60% of total consumption, which corresponds to roughly 12% of global final energy consumption [2,3]. Optimizing energy control strategies in office buildings, as one of the primary types of public buildings, and leveraging building envelope structures to enhance energy efficiency show substantial potential for energy savings and emission reductions [4]. Existing studies have shown that deep reinforcement learning (DRL)-based HVAC control can improve energy efficiency compared to rule-based or model predictive methods [5,6,7], yet most efforts remain limited to HVAC-only optimization or coordination with mechanically operable windows. These approaches do not effectively address buildings equipped with large fixed glazing systems—a common feature in high-rise office buildings—where the lack of operable windows prevents adaptive envelope interaction. Electrochromic windows (ECWs), with their ability to dynamically regulate solar heat gain without mechanical operation, provide a promising alternative, but their integration into multi-zone HVAC control has not been systematically explored. This unresolved gap forms the primary motivation of the present work.

Rule-based control (RBC) is widely adopted in HVAC systems due to its simplicity and ease of implementation. However, RBC strategies are typically static and rely heavily on the empirical knowledge of engineers and facility managers [8]. As HVAC systems and building environments become increasingly complex, the RBC method struggles to adapt to dynamic conditions such as rapid weather variations, and changes in solar heat gain, which collectively lead to significant uncertainty in building thermal loads. Model predictive control (MPC) has demonstrated robust performance across a range of building control scenarios. However, MPC also presents certain limitations. Developing models typically requires a significant amount of time and effort, as the process of creating an accurate model of the building poses challenges for practical deployment [9]. While both RBC and MPC offer distinct advantages, they exhibit limitations in control performance and generalizability across diverse building scenarios.

Deep reinforcement learning (DRL), a novel HVAC control method, possesses characteristics that enable it to address the limitations of RBC and MPC. DRL is capable of evaluating both the short-term and long-term consequences of control decisions, and can adapt to diverse environments and building configurations by learning from simulations or real-world interactions. Its ability to autonomously learn optimal control policies makes it one of the most promising methods for building energy management. Furthermore, DRL-based control methods learn directly from operational data and eliminate the need for complex building and energy system modeling, as required by MPC [10]. Current DRL-based methods primarily focus on improving temperature control algorithms and have demonstrated notable improvements in energy efficiency [5,6,7]. Several studies have extended this concept by incorporating the synergistic control of HVAC systems and building envelopes—such as windows—showing even greater potential for energy saving [11,12,13].

However, most of these DRL-based methods adopt single-agent DRL algorithms, where the state space grows exponentially when multiple zones need to be controlled, leading to the dimensionality explosion problem. Moreover, they primarily focus on mechanically operated traditional windows, offering no effective solutions for buildings equipped with fixed windows.

It is worth noting that methods based on multi-agent deep reinforcement learning (MADRL) are well-suited to address multi-zone control problems involving multiple variables, offering additional potential to reduce energy use and enhance thermal comfort in office environments. This is particularly relevant for office buildings, which often feature large fixed glass curtain walls, where retrofitting with switchable glazing typically necessitates redesigning load-bearing and sealing systems [14]. Hwang et al. [15] noted that the traditional sliding window mechanisms are not well-suited for modern high-rise buildings equipped with glass curtain walls.

ECWs have gained popularity in low-energy and high-performance building designs [16,17,18]. ECWs can reduce energy consumption by adjusting light transmittance without mechanical operation (like opening/closing), which is well suited for architectural structures such as office buildings that use large areas of fixed glass. With the increasing maturity of ECWs, their integration with intelligent HVAC control offers the potential to improve indoor thermal comfort and energy performance in office buildings.

Related Work

DRL-based HVAC control methods have recently garnered significant attention, with numerous studies aiming to enhance temperature regulation methods and minimize HVAC energy consumption while maintaining thermal comfort [19,20]. Azuatalam et al. [5] developed a reinforcement learning agent that achieved up to 22% energy savings in single-zone HVAC control scenarios. Kodama et al. [21] proposed a DRL-based method to concurrently enhance the functioning of HVAC and battery storage systems within a residence. However, as the number of controlled systems increases, the action space grows exponentially, hindering effective exploration of the state-action space and limits scalability. Bereketeab et al. [22] employed the policy-based DRL method PPO-Clip, achieving a 12.6% reduction in heating coil power consumption and a 6.7% decrease in overall HVAC energy consumption. Hu et al. [23] proposed a novel DRL method, GASAC, which increased the duration of acceptable indoor temperatures by 11.43% and decreased energy consumption by 14.05%. However, as the number of controlled systems increases, the action space of these methods grows exponentially, hindering effective exploration of the state–action space and limiting scalability—highlighting the inherent limitations of single-agent reinforcement learning in multi-system settings.

Compared to single-zone control, multi-zone HVAC control is more complex due to inter-zone thermal interactions. Considerable research has explored the use of DRL for multi-zone HVAC control. Deng et al. [24] proposed a non-stationary DQN method that combines proactive environmental change detection with reinforcement learning to adapt to varying building conditions. Their approach outperformed standard DQN in both single-zone and multi-zone cases, reducing energy consumption by 13% and improving comfort by 9%. Wang et al. [25] applied DQN for multi-zone HVAC optimization, demonstrating improvements in energy efficiency and comfort. But both of them emphasized that scalability remains a critical issue, since the action space grows exponentially with the number of zones. Blad et al. [26] introduced an LSTM-enhanced Q-learning-based multi-agent framework for real-time HVAC optimization, reporting a 19.4% reduction in heating energy use compared to RBC. Zhang et al. [27] developed a BEM-DRL framework using the A3C algorithm in combination with Bayesian optimization and genetic algorithms, which achieved a 16.7% reduction in heating demand relative to RBC. Li et al. [28] explored demand response scheduling using Trust Region Policy Optimization (TRPO) for household appliances, extending DRL to demand-side management.

Other studies investigated continuous action control. Wang et al. [29] applied both DQN and DDPG to multi-zone HVAC systems, showing that the DDPG-based method improved comfort while achieving a 10.06% energy saving compared with RBC. Gao et al. [30] combined GRUs with DRL to capture time-series dynamics, leading to a 14.5% reduction in total energy use and an 88.4% improvement in comfort performance compared to standard DRL. Li et al. [31] applied DDPG to a two-zone system and demonstrated superior performance over DQN, with a 15% gain in energy efficiency and a 79% reduction in comfort violations.

It is noteworthy that the aforementioned studies did not address the dimensionality explosion problem, whereby the global state and action spaces expand exponentially as the number of zones increases when a single agent simultaneously manages multiple HVAC systems. In contrast, MADRL methods alleviate this issue by distributing control across multiple agents. MADRL improves coordination among agents, enabling better overall control performance [32]. For instance, Liang et al. [33] adopted a MADRL approach incorporating an attention mechanism. Shen et al. [34] proposed a multi-agent co-optimization framework that integrates D3QN and DDPG, effectively handling simultaneous control of both continuous and discrete actions by multiple agents. Results showed that the multi-agent cooperative optimization framework reduced thermal discomfort duration by 84.86%. Xue et al. [8] introduced a hybrid GA-MADDPG approach for multi-zone HVAC control, achieving superior energy efficiency and thermal comfort compared to other DRL methods. Li et al. [35] developed a multi-agent thermal control framework, modeling each zone as an independent agent. TRNSYS-based simulations validated the framework’s effectiveness in improving energy efficiency and thermal comfort. Liu et al. [32] proposed a MADRL-based method for multi-zone HVAC control. The method autonomously adjusted temperature setpoints in each zone via agent collaboration, achieving a 51.09% and 4.34% reduction in power costs compared to RBC and single-agent DRL methods, while maintaining thermal comfort across zones.

The aforementioned methods have demonstrated promising outcomes. However, most MADRL-based HVAC control studies aimed at controlling the HVAC system, ignoring the interactions between the HVAC system and other building elements (e.g., windows and smart windows) in improving the building environment.

In an early study, Chen et al. [12] proposed a Q-learning-based DRL method for the joint control of HVAC systems and windows, achieving a 23% reduction in HVAC energy consumption and an 80% decrease in discomfort hours, while also demonstrating effective humidity regulation. Ding et al. [13] introduced the OCTOPUS system, which jointly controlled heating, cooling, and window operations, yielding significant energy savings while maintaining occupant comfort. Compared to RBC used in LEED Gold-certified buildings, the system improved energy efficiency by 14.26% and outperformed recent DRL methods by 8.1%, demonstrating the potential of multi-device collaborative control for building energy optimization. Xin et al. [2] leveraging the ASHRAE Global Occupant Behavior Database to enable coordinated control of HVAC systems and windows. The approach demonstrated generalizability across four climatic regions, with a 24% increase in thermal comfort hours and a 24.7% reduction in HVAC energy consumption. Li et al. [36] proposed a co-simulation framework integrating Building Energy Simulation (BES) and Computational Fluid Dynamics (CFD) with real-time control of HVAC systems and windows. The method achieved a 68.5% improvement in thermal comfort and a 43.5% reduction in daily cooling energy consumption compared to fixed-schedule operation, significantly enhancing real-time control accuracy and enabling co-optimization of comfort and energy efficiency.

ECWs are particularly attractive for low-energy construction applications among all advanced window glazings [16,17]. ECWs can regulate the radiant energy that enters a building by changing its light transmittance at low voltage [37]. Sadooghi [38] showed that the judicious management of switchable glass can improve energy efficiency in buildings. ECW applications will result in a reduction in the use of air conditioning systems. Reynisson [39] compared the energy efficiency of ECWs against conventional windows, both with and without blinds, in many European cities. Their conclusions demonstrated that energy consumption could be reduced by 10–30% relative to windows with operable blinds and by 50–75% compared to windows without blinds, suggesting that ECW can significantly reduce energy consumption. Oh et al. [40] examined the thermal load performance of ECW in Korea and determined that ECWs present the lowest heating and cooling loads when compared to alternatives such as blindsand low-emissivity double-glazed units, achieving a 31.4% reduction in total loads relative to the reference window. Most current research has focused on the independent energy-saving effects of ECW, while the synergistic control mechanisms between HVAC systems and ECW have not been sufficiently explored. Dussault et al. [41] found that ECWs reduce the peak cooling load of office buildings, potentially creating new optimization opportunities for HVAC load response strategies. However, there are no studies on the thermal comfort and energy performance of the integrated control strategy for HVAC and ECW.

In summary, although existing research on MADRL-based HVAC control and ECW applications demonstrates promising potential, most studies still face two key limitations. First, single-agent reinforcement learning algorithms suffer from the dimensionality explosion problem in multi-zone control scenarios, which severely limits scalability. Second, the synergistic optimization between HVAC systems and fixed windows has been largely overlooked, especially in high-rise office buildings where large areas of non-operable glazing are prevalent. These gaps motivate our work, which proposes a MADRL-based control framework to jointly optimize HVAC and ECW operations.

The proposed framework integrates ECW with adjustable light transmittance into the HVAC optimization, aiming to enhance building energy optimization strategies based on findings and limitations from previous research. The framework is evaluated using modeling simulations conducted in the EnergyPlus environment [42]. The proposed framework incorporates an optimization that effectively balances the agent’s individual and collective objectives, resulting in improved energy efficiency and thermal comfort in the HVAC system. The main contributions to this research are summarized as follows:

1.: We propose a multi-zone control framework that couples HVAC systems with ECWs, advancing the conventional HVAC–operable-window synergy to adaptive HVAC–ECW coordination. Designed for high-rise buildings with fixed glazing, the framework jointly optimizes HVAC operation and ECW transmittance for practical smart-building deployment. It addresses a key gap in previous work—the absence of coordinated HVAC–ECW control strategies for buildings dominated by fixed curtain walls and large fixed glazing.
2.: We adopt the Q-value Mixing (QMIX) algorithm and exploit its monotonic value-factorization to encode inter-zone dependencies, enabling cooperative multi-zone optimization while mitigating the exponential growth of the joint action space as the number of zones increases. Compared with other multi-agent methods (e.g., VDN, MADQN, MAPPO), QMIX better captures inter-zone thermal coupling, thereby achieving joint action combinations that are closer to optimal. It addresses a key gap: a scalable multi-zone controller that explicitly models thermal coupling while remaining tractable as the number of zones grows.
3.: The model was evaluated in terms of training efficiency, energy performance, and thermal comfort violation rate, and benchmarked against other multi-agent deep reinforcement learning methods. In addition, performance trade-offs and robustness were systematically analyzed under varying control priorities and climate conditions.

2. System Modeling

This section systematically formulates the thermal comfort criteria and the HVAC–ECW model, establishing a multi-zone coupled control framework grounded in thermodynamic principles. These formulations provide the theoretical foundation for the cooperative optimization framework introduced later.

2.1. Thermal Comfort Standard

To simplify the thermal comfort problem addressed in this study, we define thermal comfort as maintaining the indoor temperature of each zone within a predefined range to ensure occupant comfort [25]. In all experiments conducted in this work, the thermal comfort range is set to 22–24 °C from 6:00 a.m. to 7:00 p.m. on weekdays, and 20–26 °C during the remaining hours. The thermal comfort violation rate (TCVR) is defined as follows:

TCVR = \frac{1}{Z T} \sum_{i = 1}^{Z} \sum_{t = 1}^{T} I (T_{i, t} \notin [T_{min}, T_{max}])

(1)

where

$T_{i, t}$ is the temperature of zone i at time step t;
Z is the number of zones;
T is the total number of time steps;
$T_{min}$ and $T_{max}$ denote the lower and upper bounds of the thermal comfort range, respectively;
$I (\cdot)$ is the indicator function, which equals 1 when the condition is true and 0 otherwise.

2.2. HVAC-ECW Model

The internal heat balance of a building follows the first law of thermodynamics:

Q_{HVAC} = Q_{win} + Q_{int} + Q_{wall} - Q_{vent}

(2)

where

$Q_{HVAC}$ represents the amount of heating or cooling required from the HVAC system to maintain indoor thermal balance under ideal conditions;
$Q_{w i n}$ denotes solar heat gain through windows;
$Q_{i n t}$ refers to internal heat gains from equipment and occupants;
$Q_{w a l l}$ is the conductive heat transfer between walls and the external environment;
$Q_{v e n t}$ accounts for ventilation and air exchange losses.

The transmission of

Q_{w i n}

and

Q_{w a l l}

is affected by the building’s thermal inertia. Due to thermal lag, the actual thermal load may increase before the system can respond to temperature fluctuations. In contrast, the solar heat gain

Q_{w i n}

in buildings equipped with ECWs can be quantified as

Q_{w i n} = A \cdot I_{s o l a r} \cdot S H G C \cdot f_{a n g l e}

(3)

S H G C = τ + α \cdot N

(4)

A: area of the ECWs;
$I_{s o l a r}$ : incident solar radiation intensity;
$S H G C$ : solar heat gain coefficient;
$f_{a n g l e}$ : angle correction factor for solar incidence;
$τ$ : visible light transmittance of ECWs;
$α$ : fraction of absorbed solar radiation;
N: portion of absorbed heat transferred indoors through convection and radiation.

ECWs reduce HVAC energy consumption by dynamically adjusting

τ

, thereby regulating

Q_{w i n}

. Thus, buildings equipped with ECWs exhibit distinct thermal responses to solar radiation, enabling a transition from passive solar gain to active solar management.

2.3. Multi-Zone Thermal Coupling

When extending the model to multi-zone office buildings, it is essential to account for zone-specific solar radiation from ECWs, HVAC-induced temperature changes, and thermal coupling effects between adjacent zones. Moreover, detailed thermal transfer modeling is time-consuming and complicates zone-level temperature regulation. The coupling relationships among the zones are illustrated in the Figure 1. The properties of the walls between zones and the ECWs can be examined in the Supplementary Information. To address these challenges, a MADRL model is employed to learn the complex thermal interactions among zones. Unlike single-agent DRL models, MADRL can adapt to inter-zone thermal coupling, balance local and global optimization objectives, and mitigate the dimensionality explosion problem.

3. Methodology

The proposed framework focuses on dynamically coordinated control between HVAC systems and ECWs, leveraging real-time monitoring of indoor/outdoor temperatures, solar radiation, and equipment operating status to adapt system operations based on environmental changes. When solar radiation intensity increases, ECW light transmittance is selectively decreased to reduce solar heat gain, thereby lowering the HVAC cooling load and associated energy use. Conversely, during periods of lower solar radiation, ECW light transmittance is suitably increased to maximize beneficial solar heat gains, reducing HVAC heating requirements. Thus, this section introduces the proposed multi-zone HVAC–ECW control framework in detail.

3.1. System Description

Consider an open-plan office building divided into n distinct zones. The indoor temperatures of these zones are regulated by the proposed multi-zone HVAC–ECW control framework. To better illustrate the system composition and interactions, Figure 2 shows the overall framework of the proposed multi-zone HVAC–ECW control system. The proposed multi-zone HVAC–ECW framework comprises zone-specific HVAC units and ECWs, collectively forming an integrated control architecture. Each HVAC system includes an air handling unit (AHU), which facilitates outdoor air intake and conditioning, and a variable air volume (VAV) unit for precise temperature regulation to ensure occupant comfort [43]. The temperature in each zone is controlled by adjusting HVAC setpoints and ECW light transmittance levels, thereby managing the operation of respective VAV units and ECWs.

3.2. Control Problem Formulation

The objective of the proposed multi-zone HVAC-ECW control framework is to simultaneously ensure thermal comfort in each zone and minimize HVAC energy consumption. The optimization objective is formulated as follows:

min_{u_{HVAC}, u_{ECW}} \sum_{t = 1}^{T} [α \cdot E_{HVAC} (t) + (1 - α) \sum_{i = 1}^{Z} I (T_{i} (t) \notin [T_{min}, T_{max}])]

(5)

subject to the thermal comfort constraint:

T_{min} \leq T_{i} (t) \leq T_{max}, \forall i \in {1, 2, \dots, Z}, t \in {1, 2, \dots, T}

(6)

where

$u_{HVAC}$ and $u_{ECW}$ represent the control action sets for the HVAC and ECW systems, respectively;
$E_{HVAC} (t)$ is the HVAC system energy consumption at time step t;
$α \in [0, 1]$ is a weighting factor that balances energy efficiency and thermal comfort;
$T_{i} (t)$ denotes the indoor temperature of zone i at time step t;
$T_{min}$ and $T_{max}$ are the predefined lower and upper bounds of the thermal comfort range;
$I (\cdot)$ is an indicator function defined as

I (T_{i} (t) \notin [T_{min}, T_{max}]) = \{\begin{matrix} 1, & if T_{i} (t) < T_{min} or T_{i} (t) > T_{max} \\ 0, & otherwise \end{matrix}

(7)

To achieve coordinated control of the HVAC-ECW system, the problem is formalized as a Markov Decision Process (MDP), where each zone i is managed by a corresponding agent i. Each agent observes only the local environmental conditions relevant to its respective zone. The definitions of the global state and local observations for the agents are detailed in the subsequent section.

3.3. Global State and Local Observations

The proposed method indicates that the global state primarily facilitates coordination among agents during the training phase, whereas local observations directly guide the real-time decision-making of individual agents. During the testing phase, access to the global state is no longer available; thus, each agent must rely exclusively on its local observations to determine its actions.

The global state consists of 28 variables representing macro-level environmental conditions, the operational statuses of all HVAC-ECW subsystems, and other key real-time system-wide metrics. A detailed description and unit specification of the global state variables is presented in Table A1.

Each agent receives a set of eight local observation variables, which reflect the operational status of its corresponding HVAC-ECW unit and real-time environmental measurements specific to its zone. Table 1 provides the descriptions and units of the observations for zone i.

3.4. HVAC-ECW System Actions

Each zone’s HVAC-ECW subsystem comprises an independent HVAC unit and an associated ECW, as detailed in Section 3.1. The control strategy modulates system operation by adjusting the HVAC thermostat setpoints and the light transmittance level of the ECW. The thermostat setpoint directly influences HVAC operation within each zone, whereas the ECW light transmittance modulates solar radiation gain through the glazing. At each decision step, the agent selects an integrated action consisting of an HVAC temperature setpoint and an ECW light transmittance level. The action for agent i at time t is formally represented as

[A_{HVAC, i} (t), A_{ECW, i} (t)] for {1 \leq i \leq Z}

(8)

A_{HVAC, i} (t)

denotes the logical control signal for the HVAC system in zone i at time t, and

A_{ECW, i} (t)

represents the transmittance control level for the ECW. The HVAC activation is defined within a temperature range of [22 °C, 24 °C], while the deactivation threshold spans [5 °C, 50 °C]. A value of 0 for

A_{HVAC, i} (t)

indicates that the HVAC is off, whereas a value of 1 indicates that it is active. The ECW transmittance is discretized into six levels, with

A_{ECW, i} (t)

taking integer values from 0 to 5. A value of 0 corresponds to the ECW being fully deactivated, while values from 1 to 5 represent increasing levels of ECW light transmittance. The mapping between logical actions and their corresponding physical control signals is illustrated in Figure 3.

The action space for each individual agent remains fixed at 12 discrete options, since control decisions are confined to the agent’s own zone. Consequently, the system-level action space is derived by aggregating the actions of all agents. For a system comprising six agents, the total joint action space consists of 72 combinations. This scalability is one of the key advantages of the multi-agent approach compared to centralized single-agent methods. The issue of dimensionality expansion is effectively mitigated due to the additive nature of the action space, allowing the system to scale efficiently even when additional agents are introduced.

3.5. Reward Shaping

A well-designed reward function plays a critical role in the learning process of MADRL algorithms and significantly influences model performance after convergence. In this study, the reward function is structured into two components to enable dynamic coordination between HVAC and ECW systems, as described in Section 3.2, aiming to balance energy efficiency and thermal comfort. The reward function is defined as follows

r e w a r d = (10 - β) \cdot R_{T} + β \cdot R_{E}

(9)

R_{T} = \{\begin{matrix} Work \cdot 1, T_{i} \in [T_{m i n}, T_{m a x}], \\ - \sum_{i = 1}^{Z} Work \cdot T_{f a c t o r} {(T_{i} - (T_{m i n} + T_{m a x}) / 2)}^{2}, T_{i} \notin [T_{m i n}, T_{m a x}] . \end{matrix}

(10)

R_{E} = - E_{HVAC} \cdot E_{factor}

(11)

At each time step, the computed reward is stored in the experience replay buffer to support the training of the MADRL framework. The parameter

β

serves as a weighting factor that controls the trade-off between energy savings and thermal comfort priorities.

R_{T}

represents the temperature-based reward at each time step.

T_{f a c t o r}

scales the magnitude of the penalty when thermal comfort is violated.

Work

indicates whether the current time step falls within working hours, with a value of 1 during working periods and 0 otherwise. The thermal comfort range is defined as

[T_{m i n}, T_{m a x}]

, and

T_{i}

denotes the indoor temperature of zone i.

R_{E}

quantifies the energy consumption penalty, where

E_{HVAC}

is the energy consumed by the HVAC system at each time step, and

E_{factor}

is the corresponding penalty coefficient. Due to the scale discrepancy between

R_{T}

and

R_{E}

, post-normalization is applied by scaling

R_{E}

with a factor

E_{factor}

. This ensures that both terms remain in a comparable range, preventing energy penalties from dominating the reward and enabling balanced learning between energy savings and thermal comfort.

3.6. QMIX Algorithm

This study employs the QMIX algorithm as the core of the proposed framework to address the collaborative optimization problem inherent in the multi-zone HVAC-ECW control framework. To the best of our knowledge, this is the first application of QMIX to building energy optimization. Its design accommodates continuous state–discrete action spaces through a value function factorization framework, which distinguishes it from conventional single-agent DRL methods. QMIX adopts a centralized training and decentralized execution (CTDE) paradigm, in which agent-specific action values are nonlinearly combined to produce a global Q-value that guides coordinated decision-making. A distinguishing feature of QMIX is the incorporation of a hypernetwork, which generates the mixing weights based on the global state. This allows the mixing network to flexibly model interactions between agent-specific Q-values and the global state. By decoupling individual agent policies and employing monotonic mixing constraints, QMIX effectively reduces the complexity of the joint action space from exponential to linear, enabling scalable learning in multi-zone scenarios. This method effectively mitigates the dimensionality explosion commonly observed in conventional multi-zone cooperative control schemes. The subsequent paragraphs provide a detailed description of the design and implementation of the QMIX-based training framework adopted in this study.

As an off-policy algorithm, QMIX consists of four main neural network components: local Q-networks for individual agents, corresponding target Q-networks, a mixing network, and a target mixing network. As illustrated in Figure 4, each agent is assigned its own local Q-network, which estimates Q-values based on local observations. These local Q-networks estimate action-values using zone-specific observations such as indoor temperature, ECW light transmittance levels, and HVAC operational status. The mixing network aggregates individual agent Q-values and combines them with the global state through a nonlinear mixing function to generate the joint Q-value. During the training process, experience tuples are collected in the following format:

{s_{t}^{i}, o_{t}^{i}, a_{t}^{i}, s_{t + 1}^{i}, o_{t + 1}^{i}}, f o r {1 \leq i \leq Z}

(12)

In the equation, i denotes the agent associated with zone i, and t represents the current time step. At each time step, experience tuples are stored in a replay buffer as part of the QMIX experience replay mechanism. Once a sufficient number of experience tuples have been accumulated, mini-batches are randomly sampled from the replay buffer to update the neural network parameters. Each agent’s local Q-network computes the current Q-value

Q (s_{t, i}, a_{t, i})

based on its local observation

o_{t, i}

. The target local Q network estimates the subsequent Q-value

Q (s_{t + 1, i}, a_{t + 1, i})

using the observation from the next time step

o_{t + 1, i}

. Subsequently, the mixing network and target mixing network respectively aggregate the agent-specific Q-values to obtain the current and target joint Q-values. The loss function is computed as the mean squared error (MSE) between the predicted joint Q-value

Q (s_{t}, a_{t})

and the target Q-value. Additionally, a cosine annealing learning rate schedule is adopted to improve training stability and facilitate convergence.

A soft update mechanism is periodically employed to synchronize network parameters, ensuring stable Q-value estimation and mitigating policy oscillation during training. After training completion, each agent independently selects actions based on the highest local Q-value, enabling decentralized real-time execution. The hyperparameters utilized for training the QMIX model are detailed in Table A2.

4. Case Study

The building simulation environment is constructed using OpenStudio, comprising six distinct thermal zones (as illustrated in Figure 5), each equipped with independent HVAC and ECWs [44]. EnergyPlus outputs a set of environmental variables at each simulation time step, which serve as input feature vectors for the QMIX-based MADRL framework. The control method of the multi-zone HVAC-ECW control framework is implemented using a Python-based control interface. The source code is publicly available at: https://github.com/Du0yu/EP-DL, accessed on 28 August 2025.

We set the time increment in the modeling simulation experiment to one hour. At each time step, EnergyPlus outputs a set of environmental variables (listed in Table 2) that serve as the input feature vector for the QMIX framework. The corresponding detailed data are available in the IDF file included in the Supplementary Information. These real-time observations form the basis for agent-level control decisions in each individual zone. The discrete control outputs are translated into physical actuation commands for HVAC and ECW components through the signal conversion mechanism illustrated in Figure 3. These control commands are transmitted via the EnergyPlus API back to the simulation environment, enabling closed-loop interaction between the agent and the building system.

To enhance the model’s generalization capability across diverse climatic conditions, a full-year weather dataset (8760 h) is used for simulation training. The weather data “CHN_SH_Lang.Gang.Island.584760_TMYx.epw” was used as the representative climate profile for the simulation, obtained from the EnergyPlus weather database [45].

As detailed in Section 3.6, each training episode consists of 8760 time steps, corresponding to the full annual simulation period defined by the weather dataset. Training begins once the experience replay buffer reaches a minimum threshold of 256 transitions. A sliding window sampling strategy is adopted, in which mini-batches of 128 transitions are randomly drawn from the replay buffer at each training step to update the network parameters. Target network parameters are synchronized with the main networks every 10 time steps following gradient descent updates. The complete training process consists of 100 episodes, with each episode simulating 8760 time steps—equivalent to one full year of operation. After convergence, four independent evaluation runs are performed to assess the robustness and control performance of the trained model.

5. Result

5.1. Overall

This section presents a comprehensive evaluation of the proposed multi-zone HVAC–ECW control framework across multiple dimensions, including training efficiency, energy savings, and thermal comfort performance. A comparative analysis with several state-of-the-art MADRL algorithms is also provided. Additionally, the effects of different energy–comfort trade-off coefficients on energy consumption and thermal comfort were investigated. Finally, the framework’s robustness is assessed under various climatic conditions to verify its generalizability. Apart from the detailed discussion of the relationship between

β

and performance in Section 5.4, the main results of this study are obtained under the condition of

β

= 5.

5.2. Baseline Experiment

Two HVAC control strategies are compared: an RBC method as the baseline and a DQN method as the benchmark. The RBC strategy serves as the reference baseline, whereas DQN represents a stronger learning-based benchmark. All energy savings are calculated relative to the baseline control strategy. For reproducibility, the complete IDF file is provided in the Supplementary Information. Appendix A.3 details the RBC logic.

Table 3 summarizes the performance comparison between RBC, DQN, and QMIX in terms of energy consumption, TCVR, and temperature deviation. Compared to the RBC control strategy, the application of the DQN method for HVAC system control results in a 20.16% reduction in energy consumption, with a TCVR of merely 0.60% and a temperature deviation maintained within 0.59 °C.

5.3. Evaluation of QMIX Algorithm Performance

Figure 6 shows the reward convergence trajectory of QMIX across multiple training iterations. The plotted reward is the total return per episode, with higher values denoting stronger energy-saving performance and superior thermal comfort. The model converges after approximately 4 h, reflecting an increased training duration compared to the single-agent DQN model. This extended training time is attributed to QMIX’s multi-agent joint optimization mechanism, which simultaneously optimizes control actions across six zones and integrates value functions within the joint action space to enable coordinated temperature regulation. To ensure optimal performance, the QMIX model was trained over 100 episodes, each representing a full-year simulation cycle. We compare the QMIX results with the strong benchmark. Table 3 indicates that QMIX achieves an average energy efficiency of 40.28% during evaluation. It also substantially reduces the TCVR to 0.12% and limits the average temperature deviation to 0.35 °C. These results indicate that QMIX substantially improves energy efficiency while maintaining high thermal comfort, highlighting its overall effectiveness.

Figure 7 shows the comparison of energy consumption between QMIX and the baseline within a specified period of time. The energy consumption curve of the HVAC system in the baseline shows a more complex pattern, typically resulting from the frequent short-term start–stop cycles of the HVAC system. In the baseline experiment, the HVAC system operated for a long time, while the framework indicates that HVAC energy consumption primarily exhibits a single peak. The HVAC system starts with higher power for a short period before entering a quieter mode, thus reducing the effects of a wide range of high power usage and repeating start–stop cycles on the system. Table 3 illustrates that the QMIX framework improves energy efficiency by 20% when compared to the DQN method, which only implements single-dimensional regulation on the HVAC system. In addition, QMIX’s more precise regulatory strategy resulted in a reduction in thermal comfort violations, which aligns with the theory described in Section 2.

However, Figure 7 shows that the energy consumption curve of the QMIX strategy surges, with its immediate peak power exceeding that of the baseline experiment. The QMIX method places the HVAC system in a short-term high-load state to quickly eliminate an appropriate thermal accumulation then maintains temperature stability by dynamically dimming the ECW (such as the sudden reduction of the SHGC from 0.6 to 0.3) and building thermal inertia to obtain higher training rewards. However, our study does not incorporate additional optimization for HVAC load peaks, and the HVAC system modeling lacks sufficient refinement. As a result, the algorithm assigns excessive priority to time-dimension optimization in the design of reward functions, thereby overlooking the potential value of load space redistribution. Consequently, high-load HVAC peaks remain an issue that needs to be addressed in future work.

5.4. Energy Efficiency–Comfort Balance Test

This section examines the impact of the energy–comfort balance coefficient (

β

) on the overall performance of the control framework. Figure 8 illustrates how varying

β

values influence model performance in terms of both energy savings and TCVR. As

β

increases from 0.1 to 9.9, the TCVR progressively rises in a nonlinear manner, with an inflection point observed around

β

= 5. Beyond

β

= 5, the violation rate exhibits approximately exponential growth, while energy efficiency improves, reaching a 15.2% increase at

β

= 9.9. This behavior results from the interaction between HVAC system operating frequency and comfort parameters: A higher

β

tolerates greater temperature deviations, thereby reducing HVAC activation frequency in exchange for increased reward. As

β

increases from 5 to 9.9, energy savings improve by approximately 15%; however, the TCVR increases sharply from below 1% to over 10%. Therefore, selecting an appropriate

β

value is essential to balance energy savings and thermal comfort according to specific operational requirements.

5.5. Comparative Analysis of Contemporary MADRL Algorithms

This section presents a comparative performance analysis of the proposed framework against several state-of-the-art MADRL algorithms. These MADRL algorithms are chosen for their representative performance in the recent literature, evaluated under a standardized experimental benchmark for consistency and fairness. All models share identical state, observation, and reward configurations as defined in Section 3, along with the same simulation settings detailed in Section 4, to ensure experimental integrity. Network architectures and hyperparameters are aligned across all algorithms and kept consistent with those used in QMIX.

Figure 9 displays the reward convergence curves for all MADRL algorithms under identical simulation conditions. The VDN algorithm demonstrates relatively fast convergence and stable intermediate performance. Although QMIX converges a bit slowly, it ultimately achieves the highest cumulative reward among all compared methods. Table 4 summarizes the post-convergence performance metrics, showing that the proposed framework outperforms others in key indicators, including energy savings and thermal comfort. The framework achieves a TCVR of approximately 0.12%, lower than that of competing algorithms, and delivers up to 40% improvement in energy efficiency. Although the temperature deviation differs by 0.01 °C between the proposed method and MADQN, this minimal difference is negligible in practical applications.

5.6. Verification of Generalization and Robustness

This section evaluates the generalization and robustness of the proposed framework by assessing energy efficiency and TCVR under weather conditions from various climatic regions. The results are summarized in Table 5. In Table 5, “Temperature Offset” refers to the deviation from the comfort temperature range and “Energy Efficiency” represents the percentage reduction in HVAC energy consumption. The proposed framework maintains optimal performance in key regions such as Singapore, Hong Kong, and Guangzhou, demonstrating strong adaptability to different climatic conditions.

Regions such as Beijing and Chengdu are characterized by pronounced seasonal temperature fluctuations, resulting in substantial variability in outdoor thermal conditions and consequently high HVAC energy demand with considerable potential for optimization. In these regions, the proposed framework achieves energy savings of up to 48%, highlighting its effectiveness in variable climate conditions. By contrast, in tropical and subtropical cities such as Singapore and Hong Kong, where outdoor temperatures remain relatively stable and high throughout the year, HVAC demand is consistently high but the optimization potential is more limited, yielding achievable energy savings of approximately 16–24%.

As depicted in Figure 10, regions experiencing sustained thermal discomfort exhibit higher prospective HVAC energy savings. Conversely, in regions such as Singapore, where ambient temperatures remain relatively stable for most of the year and humidity levels are higher, HVAC demand is greater, thereby constraining the optimization potential. These findings underscore the importance of considering regional climatic conditions when applying energy optimization strategies in buildings.

The experimental results presented in this section validate the efficacy of the proposed method in reducing energy consumption across diverse climatic conditions, while also demonstrating its robust performance. These findings indicate that the framework is broadly applicable to various weather regions.

5.7. Impact of ECW Switching Delay on Framework Performance

Because the minimum time step of the simulation platform cannot be reduced to the second level, this study did not explicitly model the switching delay of ECWs nor assess its impact experimentally. Instead, we indirectly estimated the potential impact of switching delay by examining whether shortening the control interval led to substantial performance improvements. As shown in Table 6, reducing the simulation control interval from one hour to the minute level resulted in only marginal improvements. This finding indirectly suggests that even if minute-level control were combined with near-instantaneous ECW switching, the overall performance gains would remain limited.

In addition, previous studies have reported that the switching time of advanced ECWs typically falls within 5–10 s, depending on material composition and device configuration [46,47]. For instance, an electrochromic device based on ammonium metatungstate–iron(II) chloride liquid exhibited a response time of approximately 8.5 s [46], which was further reduced to 6.5 s through machine learning–guided optimization [47]. Based on these findings, we expect that ECW switching delays are relatively minor and unlikely to significantly affect overall system performance. However, we acknowledge that the present study does not fully capture the hardware-level dynamics of ECWs, and in practical applications, switching delays may still represent an important factor influencing control performance.

6. Discussion

This study investigates the integration of DRL techniques for HVAC control and explores the emerging role of ECWs in improving energy efficiency. While previous studies have demonstrated the individual effectiveness of DRL-based HVAC control and ECWs in reducing building energy consumption, they often overlooked the potential benefits of their coordinated control. Furthermore, the dimensionality explosion associated with single-agent DRL methods remains an unresolved challenge.

Our findings indicate that MADRL is capable of explicitly modeling inter-zone thermal coupling and mitigating the dimensionality explosion problem inherent in single-agent RL approaches such as DQN. Additionally, the coordinated control of HVAC systems and building envelope components, such as windows and smart windows, further enhances overall energy performance.

This study presents a novel MADRL-based framework that enables the co-optimization of HVAC systems and ECWs in smart buildings. Experimental results demonstrate that the proposed framework achieves superior energy efficiency compared to those excluding ECW integration. Moreover, QMIX effectively alleviates the growth of the state–action space induced by an increasing number of zones, thereby improving scalability. For instance, controlling 10 zones using a DQN-based model would result in an action space of

12^{10}

, which severely hinders convergence. In contrast, QMIX reduces the joint action space to 120. Although the convergence time increases compared to scenarios with six zones, it remains within a practical and acceptable range.

Although Xue et al. [8] employed MADDPG to address the dimensionality explosion problem in single-agent RL, their training process was considerably time-consuming. Notably, their training duration was one month, whereas our framework was trained over a full annual cycle. Furthermore, their study controlled only three zones, compared to six zones in our framework, suggesting that QMIX offers enhanced scalability over MADDPG. In addition, this study builds upon Chen et al.’s research [12] by replacing conventional windows with ECWs, thereby enhancing energy efficiency and offering a novel solution for buildings with fixed glazing.

Despite the promising simulation results, several challenges remain in translating the proposed framework from theoretical simulation to real-world implementation:

1.: Stochastic occupant behavior. In real-world, occupant behavior is significantly less predictable than the patterns assumed in EnergyPlus simulations. The randomness associated with occupant actions may reduce both the energy-saving potential of the proposed framework and its capacity to maintain thermal comfort. The influence of such behavioral variability on thermal loads and control effectiveness warrants further investigation in future research.
2.: Lack of real-world validation. The current framework has only been validated through simulation, without consideration of hardware constraints or deployment challenges. Although deploying pre-trained models is technically feasible, discrepancies between simulated and real-world environments may adversely affect control performance. Nevertheless, the present work does not include hardware-in-the-loop validation, and the control implications of ECW hysteresis or partial-state transitions are not explicitly modeled. Future research will incorporate real device characteristics into the control framework to more comprehensively assess the effect of ECW switching dynamics.
3.: Absence of lighting control. In many office scenarios, lighting is typically assumed to remain continuously on, and we adopt this assumption in our experiments to simplify the research problem. As a result, considerations of visual comfort related to daylight and glare were not included. Future work could extend the proposed framework by incorporating lighting control to more comprehensively address occupant comfort.
4.: Under high-load HVAC peaks, the current framework does not address the load ramping penalties observed in the Figure 7. Future work could incorporate more comprehensive physical constraints into the algorithm to improve its representation of real operating conditions, thereby facilitating further progress toward real-world deployment.

7. Conclusions

This study proposes a multi-zone HVAC–ECW control framework based on MADRL to enhance energy efficiency while maintaining thermal comfort, addressing the research gap in coordinated control between HVAC systems and building components. This work also aligns with the emerging research trend of applying reinforcement learning algorithms to integrated energy systems. To the best of our knowledge, this is the first study to integrate ECWs into a building energy management system through MADRL-based control. The framework leverages the QMIX algorithm to explicitly model thermal coupling among individual zones, thereby enabling cooperative optimization across multiple zones and alleviating the dimensionality explosion problem. EnergyPlus simulations validated the effectiveness of the proposed framework in improving energy efficiency while maintaining indoor thermal comfort. Compared to the baseline experiment, the energy consumption of the HVAC system can be reduced by 40%.

It also performs comparably to other state-of-the-art MADRL algorithms and maintains strong performance across diverse climatic regions, supporting its generalizability. Moreover, our study suggests that the switching delay of ECWs is generally small, and compared with the control time step adopted in this study, its impact is expected to be negligible. This suggests that ECWs switching delays may not constitute a major barrier to real-world deployment, thereby providing preliminary support for the practical feasibility of the proposed framework. However, certain simplifications were made in this study, creating a gap between the proposed model and practical deployment. The framework has notable limitations. For instance, it does not account for the impact of stochastic occupant behavior, lacks validation through real-world deployment, and does not incorporate lighting and ventilation into the control scope. Future research may focus on three critical factors that strongly influence deployment feasibility and occupant satisfaction: stochastic occupant behavior, dynamic comfort ranges, and load ramping mitigation for HVAC systems. These directions will further bridge the gap between simulation-based studies and the real-world deployment of RL-based building control frameworks.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/buildings15173114/s1.

Author Contributions

Conceptualization, H.C. and D.S.; Methodology, H.C.; Software, H.C. and D.S.; Validation, H.C., D.S. and H.Y.; Formal Analysis, H.C. and D.S.; Investigation, H.C.; Resources, Y.Z. and H.C.; Data Curation, H.C. and Y.S.; Writing—Original Draft, H.C.; Writing—Review and Editing, H.Y., D.S.,Y.S., Y.Z. and H.C.; Visualization, D.S. and Y.S.; Supervision, H.C. and D.S.; Project Administration, H.C. and D.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Code is available at: https://github.com/Du0yu/EP-DL, accessed on 28 August 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. State Table

Table A1. State variables definition.

Variable	Definition	Unit
SA0	Solar Altitude Angle	Degrees
SH0	Solar Hour Angle	Degrees
O0	Outdoor Temperature	°C
W0	Work Time Flag	Binary (0 or 1)
solar_Win10	Zone 1 Window Solar Radiation	W/m²
solar_Win20	Zone 2 Window Solar Radiation	W/m²
solar_Win30	Zone 3 Window Solar Radiation	W/m²
solar_Win40	Zone 4 Window Solar Radiation	W/m²
solar_Win50	Zone 5 Window Solar Radiation	W/m²
solar_Win60	Zone 6 Window Solar Radiation	W/m²
Win_10	Zone 1 Window Status	Status
Win_20	Zone 2 Window Status	Status
Win_30	Zone 3 Window Status	Status
Win_40	Zone 4 Window Status	Status
Win_50	Zone 5 Window Status	Status
Win_60	Zone 6 Window Status	Status
T_10	Zone 1 Temperature	°C
T_20	Zone 2 Temperature	°C
T_30	Zone 3 Temperature	°C
T_40	Zone 4 Temperature	°C
T_50	Zone 5 Temperature	°C
T_60	Zone 6 Temperature	°C
H_10	Zone 1 Heating Temperature	°C
H_20	Zone 2 Heating Temperature	°C
H_30	Zone 3 Heating Temperature	°C
H_40	Zone 4 Heating Temperature	°C
H_50	Zone 5 Heating Temperature	°C
H_60	Zone 6 Heating Temperature	°C

Appendix A.2. QMIX’s Hyperparameter

Table A2. Model parameters.

Category	Parameter	Value
Environment	cuda	True
Environment	alg	qmix
Training	gamma	0.95
	evaluate_epoch	4
	load_model	True
	evaluate	False
	total_episodes	100
Task Settings	obs_shape	8
	n_actions	12
	n_agents	6
	state_shape	28
Network Architecture	mlp_hidden_dim	128
	qmix_hidden_dim	64
	two_hyper_layers	True
	hyper_hidden_dim	128
Optimization	lr	$1 \times 10^{- 3}$
Optimization	lr_cosine_annealing	True
Experience Replay	minimal_size	256
	batch_size	128
	buffer_size	1000
Temperature Parameters	temp_start	1
	min_temp	0.01
	temp_anneal_steps	300,000
Training Control	target_update_cycle	10
	log_interval	5000
	save_cycle	8000

Appendix A.3. RBC Baseline Specification

As shown in Algorithm A1, in the RBC strategy, only the HVAC system is controlled, while no control is applied to the ECWs. The temperature setpoint range is maintained at 22–24 °C from 6:00 a.m. to 7:00 p.m. on weekdays and 20–26 °C during the remaining hours. The HVAC system remains off on weekends. In the Supplementary Information, the IDF file includes detailed data such as building modeling parameters, occupant schedules, and all HVAC system settings. The RBC strategy can be directly implemented using the EnergyPlus simulation platform without the need for additional coding.

Algorithm A1 Rule-Based Control (RBC) Strategy

1:: procedure RBC_Control( $t, z o n e s$ )
2:: $d a y \leftarrow$ Weekday(t) ▹ 0 = Mon, …, 6 = Sun
3:: $h o u r \leftarrow$ Hour(t)
4:: if $d a y \in {5, 6}$ then ▹ Weekend (Sat or Sun)
5:: $H V A C_O N \leftarrow F a l s e$
6:: for all $z o n e \in z o n e s$ do
7:: Send_HVAC_Enable( $z o n e, H V A C_O N$ )
8:: end for
9:: else ▹ Weekday (Mon–Fri)
10:: $H V A C_O N \leftarrow T r u e$
11:: if $6 \leq h o u r < 19$ then ▹ Work hours
12:: $H e a t i n g_S e t T e m p \leftarrow 22$ °C
13:: $C o o l i n g_S e t T e m p \leftarrow 24$ °C
14:: else
15:: $H e a t i n g_S e t T e m p \leftarrow 20$ °C
16:: $C o o l i n g_S e t T e m p \leftarrow 26$ °
17:: end if
18:: for all $z o n e \in z o n e s$ do
19:: Send_HVAC_Enable( $z o n e, H V A C_O N$ )
20:: Send_Heating_SetTemp( $z o n e, H e a t i n g_S e t T e m p$ )
21:: Send_Cooling_SetTemp( $z o n e, C o o l i n g_S e t T e m p$ )
22:: end for
23:: end if
24:: end procedure

References

Hamilton, I.; Kennard, H.; Rapf, O.; Kockat, J.; Zuhaib, S.; Global Alliance for Buildings and Construction. 2020 Global Status Report for Buildings and Construction: Towards a Zero-Emissions, Efficient and Resilient Buildings and Construction Sector. 2020. Available online: https://globalabc.org/sites/default/files/inline-files/2020%20Buildings%20GSR_FULL%20REPORT.pdf (accessed on 30 May 2025).
Liu, X.; Gou, Z. Occupant-centric HVAC and window control: A reinforcement learning model for enhancing indoor thermal comfort and energy efficiency. Build. Environ. 2024, 250, 111197. [Google Scholar] [CrossRef]
Gao, Y.; Miyata, S.; Akashi, Y. Energy saving and indoor temperature control for an office building using tube-based robust model predictive control. Appl. Energy 2023, 341, 121106. [Google Scholar] [CrossRef]
Cao, X.; Wang, K.; Xia, L.; Yu, W.; Li, B.; Yao, J.; Yao, R. A three-stage decision-making process for cost-effective passive solutions in office buildings in the hot summer and cold winter zone in China. Energy Build. 2022, 268, 112173. [Google Scholar] [CrossRef]
Azuatalam, D.; Lee, W.L.; De Nijs, F.; Liebman, A. Reinforcement learning for whole-building HVAC control and demand response. Energy AI 2020, 2, 100020. [Google Scholar] [CrossRef]
Kurte, K.; Munk, J.; Kotevska, O.; Amasyali, K.; Smith, R.; McKee, E.; Du, Y.; Cui, B.; Kuruganti, T.; Zandi, H. Evaluating the adaptability of reinforcement learning based HVAC control for residential houses. Sustainability 2020, 12, 7727. [Google Scholar] [CrossRef]
Zhong, X.; Zhang, Z.; Zhang, R.; Zhang, C. End-to-end deep reinforcement learning control for HVAC systems in office buildings. Designs 2022, 6, 52. [Google Scholar] [CrossRef]
Xue, W.; Jia, N.; Zhao, M. Multi-agent deep reinforcement learning based HVAC control for multi-zone buildings considering zone-energy-allocation optimization. Energy Build. 2025, 329, 115241. [Google Scholar] [CrossRef]
Ganesh, H.S.; Seo, K.; Fritz, H.E.; Edgar, T.F.; Novoselac, A.; Baldea, M. Indoor air quality and energy management in buildings using combined moving horizon estimation and model predictive control. J. Build. Eng. 2021, 33, 101552. [Google Scholar] [CrossRef]
Wang, D.; Zheng, W.; Wang, Z.; Wang, Y.; Pang, X.; Wang, W. Comparison of reinforcement learning and model predictive control for building energy system optimization. Appl. Therm. Eng. 2023, 228, 120430. [Google Scholar] [CrossRef]
Peng, Y.; Lei, Y.; Tekler, Z.D.; Antanuri, N.; Lau, S.K.; Chong, A. Hybrid system controls of natural ventilation and HVAC in mixed-mode buildings: A comprehensive review. Energy Build. 2022, 276, 112509. [Google Scholar] [CrossRef]
Chen, Y.; Norford, L.K.; Samuelson, H.W.; Malkawi, A. Optimal control of HVAC and window systems for natural ventilation through reinforcement learning. Energy Build. 2018, 169, 195–205. [Google Scholar] [CrossRef]
Ding, X.; Du, W.; Cerpa, A. OCTOPUS: Deep reinforcement learning for holistic smart building control. In Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, New York, NY, USA, 13–14 November 2019; pp. 326–335. [Google Scholar]
Lee, H.; Oh, M.; Seo, J.; Kim, W. Seismic and energy performance evaluation of large-scale curtain walls subjected to displacement control fasteners. Appl. Sci. 2021, 11, 6725. [Google Scholar] [CrossRef]
Hwang, R.L.; Chen, W.A. Creating glazed facades performance map based on energy and thermal comfort perspective for office building design strategies in Asian hot-humid climate zone. Appl. Energy 2022, 311, 118689. [Google Scholar] [CrossRef]
Xu, Y.; Yan, C.; Yan, S.; Liu, H.; Pan, Y.; Zhu, F.; Jiang, Y. A multi-objective optimization method based on an adaptive meta-model for classroom design with smart electrochromic windows. Energy 2022, 243, 122777. [Google Scholar] [CrossRef]
Macrelli, G. Electrochromic windows. In Proceedings of the Renewables: The Energy for the 21st Century World Renewable Energy Congress VI, Brighton, UK, 1–7 July 2000; Elsevier: Amsterdam, The Netherlands, 2000; pp. 177–183. [Google Scholar]
Wu, Y.; Kong, S.; Yao, Q.; Li, M.; Lai, H.; Sun, D.; Cai, Q.; Qiu, Z.; Ning, H.; Zhang, Y. Machine Learning-Guided Cycle Life Prediction for Electrochromic Devices Based on Deuterium and Water Mixing Solvent. Micromachines 2024, 15, 1073. [Google Scholar] [CrossRef]
Al Sayed, K.; Boodi, A.; Broujeny, R.S.; Beddiar, K. Reinforcement learning for HVAC control in intelligent buildings: A technical and conceptual review. J. Build. Eng. 2024, 95, 110085. [Google Scholar] [CrossRef]
Jiang, Z.; Risbeck, M.J.; Ramamurti, V.; Murugesan, S.; Amores, J.; Zhang, C.; Lee, Y.M.; Drees, K.H. Building HVAC control with reinforcement learning for reduction of energy cost and demand charge. Energy Build. 2021, 239, 110833. [Google Scholar] [CrossRef]
Kodama, N.; Harada, T.; Miyazaki, K. Home energy management algorithm based on deep reinforcement learning using multistep prediction. IEEE Access 2021, 9, 153108–153115. [Google Scholar] [CrossRef]
Bereketeab, L.; Zekeria, A.; Aloqaily, M.; Guizani, M.; Debbah, M. Energy Optimization in Sustainable Smart Environments With Machine Learning and Advanced Communications. IEEE Sens. J. 2024, 24, 5704–5712. [Google Scholar] [CrossRef]
Hu, Z.; Gao, Y.; Sun, L.; Mae, M.; Imaizumi, T. A novel reinforcement learning method based on generative adversarial network for air conditioning and energy system control in residential buildings. Energy Build. 2025, 336, 115564. [Google Scholar] [CrossRef]
Deng, X.; Zhang, Y.; Zhang, Y.; Qi, H. Towards optimal HVAC control in non-stationary building environments combining active change detection and deep reinforcement learning. Build. Environ. 2022, 211, 108680. [Google Scholar] [CrossRef]
Wang, H.; Chen, X.; Vital, N.; Duffy, E.; Razi, A. Energy optimization for HVAC systems in multi-VAV open offices: A deep reinforcement learning approach. Appl. Energy 2024, 356, 122354. [Google Scholar] [CrossRef]
Blad, C.; Bøgh, S.; Kallesøe, C.S. Data-driven offline reinforcement learning for HVAC-systems. Energy 2022, 261, 125290. [Google Scholar] [CrossRef]
Zhang, Z.; Chong, A.; Pan, Y.; Zhang, C.; Lam, K.P. Whole building energy model for HVAC optimal control: A practical framework based on deep reinforcement learning. Energy Build. 2019, 199, 472–490. [Google Scholar] [CrossRef]
Li, H.; Wan, Z.; He, H. Real-time residential demand response. IEEE Trans. Smart Grid 2020, 11, 4144–4154. [Google Scholar] [CrossRef]
Wang, M.; Lin, B. MF^2: Model-free reinforcement learning for modeling-free building HVAC control with data-driven environment construction in a residential building. Build. Environ. 2023, 244, 110816. [Google Scholar] [CrossRef]
Gao, Y.; Shi, S.; Miyata, S.; Akashi, Y. Successful application of predictive information in deep reinforcement learning control: A case study based on an office building HVAC system. Energy 2024, 291, 130344. [Google Scholar] [CrossRef]
Li, F.; Du, Y. Intelligent multi-zone residential HVAC control strategy based on deep reinforcement learning. Appl. Energy 2021, 281, 116117. [Google Scholar] [CrossRef]
Liu, X.; Wu, Y.; Wu, H. Enhancing HVAC energy management through multi-zone occupant-centric approach: A multi-agent deep reinforcement learning solution. Energy Build. 2024, 303, 113770. [Google Scholar] [CrossRef]
Yu, L.; Sun, Y.; Xu, Z.; Shen, C.; Yue, D.; Jiang, T.; Guan, X. Multi-agent deep reinforcement learning for HVAC control in commercial buildings. IEEE Trans. Smart Grid 2020, 12, 407–419. [Google Scholar] [CrossRef]
Shen, R.; Zhong, S.; Zheng, R.; Yang, D.; Xu, B.; Li, Y.; Zhao, J. Advanced control framework of regenerative electric heating with renewable energy based on multi-agent cooperation. Energy Build. 2023, 281, 112779. [Google Scholar] [CrossRef]
Li, J.; Zhang, W.; Gao, G.; Wen, Y.; Jin, G.; Christopoulos, G. Toward intelligent multizone thermal control with multiagent deep reinforcement learning. IEEE Internet Things J. 2021, 8, 11150–11162. [Google Scholar] [CrossRef]
Li, Y.; Li, L.; Cui, X.; Shen, P. Coupled building simulation and cfd for real-time window and hvac control in sports space. J. Build. Eng. 2024, 97, 110731. [Google Scholar] [CrossRef]
Piccolo, A.; Simone, F. Performance requirements for electrochromic smart window. J. Build. Eng. 2015, 3, 94–103. [Google Scholar] [CrossRef]
Sadooghi, P. HVAC electricity and natural gas saving potential of a novel switchable window compared to conventional glazing systems: A Canadian house case study in city of Toronto. Sol. Energy 2022, 231, 129–139. [Google Scholar] [CrossRef]
Reynisson, H. Energy Performance of Dynamic Windows in Different Climates. Master’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2015. [Google Scholar]
Oh, M.; Tae, S.; Hwang, S. Analysis of heating and cooling loads of electrochromic glazing in high-rise residential buildings in South Korea. Sustainability 2018, 10, 1121. [Google Scholar] [CrossRef]
Dussault, J.M.; Gosselin, L. Office buildings with electrochromic windows: A sensitivity analysis of design parameters on energy performance, and thermal and visual comfort. Energy Build. 2017, 153, 50–62. [Google Scholar] [CrossRef]
Crawley, D.B.; Lawrie, L.K.; Winkelmann, F.C.; Buhl, W.F.; Huang, Y.J.; Pedersen, C.O.; Strand, R.K.; Liesen, R.J.; Fisher, D.E.; Witte, M.J. EnergyPlus: Creating a new-generation building energy simulation program. Energy Build. 2001, 33, 319–331. [Google Scholar] [CrossRef]
Okochi, G.S.; Yao, Y. A review of recent developments and technological advancements of variable-air-volume (VAV) air-conditioning systems. Renew. Sustain. Energy Rev. 2016, 59, 784–817. [Google Scholar] [CrossRef]
Guglielmetti, R.; Macumber, D.; Long, N. OpenStudio: An open source integrated analysis platform. In Proceedings of the Building Simulation 2011: 12th Conference of International Building Performance Simulation Association, Sydney, Australia, 14–16 November 2011. [Google Scholar]
Fumo, N.; Mago, P.; Luck, R. Methodology to estimate building energy consumption using EnergyPlus Benchmark Models. Energy Build. 2010, 42, 2331–2337. [Google Scholar] [CrossRef]
Kong, S.; Zhang, G.; Li, M.; Yao, R.; Guo, C.; Ning, H.; Zhang, J.; Tao, R.; Yan, H.; Lu, X. Investigation of an electrochromic device based on ammonium metatungstate-iron (II) chloride electrochromic liquid. Micromachines 2022, 13, 1345. [Google Scholar] [CrossRef] [PubMed]
Kong, S.; Li, M.; Xiang, Y.; Wu, Y.; Fan, Z.; Yang, H.; Cai, Q.; Zhang, M.; Zhang, Y.; Ning, H. Machine learning-guided investigation for a high-performance electrochromic device based on ammonium metatungstate-iron (ii) chloride-heavy water electrochromic liquid. J. Mater. Chem. C 2023, 11, 12776–12784. [Google Scholar] [CrossRef]

Figure 1. Zones’ coupling relationships.

Figure 2. System framework.

Figure 3. Actions transfer.

Figure 4. QMIX-based framework.

Figure 5. Building model with six zones.

Figure 6. Reward convergence of QMIX during training.

Figure 7. The comparison of energy consumption between QMIX and the baseline.

Figure 8. Trade-off between energy efficiency and thermal comfort.

Figure 9. Reward convergence during RL training.

Figure 10. Weather conditions in 6 different places.

Table 1. The observations of zone i.

Variable	Definition	Unit
SA	Solar Altitude Angle	Degrees
SH	Solar Hour Angle	Degrees
O	Outdoor Temperature	°C
W	Work Time Flag	Binary (0 or 1)
solar_Win	Zone 1 ECW Solar Radiation	W/m²
Win	Zone 1 ECW Status	Status
T	Zone 1 Temperature	°C
H	Zone 1 Heating Temperature	°C

Table 2. Model parameters and their values.

Model	Parameter	Value
	Ceiling Height (m)	2.44
Building	Total Building Area (m²)	920.22
	Total Building Volume (m³)	2271.46
	Coil Cooling Rated High Speed COP	3.0
HVAC	Coil Cooling Rated Low Speed COP	3.0
	Fan Total Efficiency	0.60
	Coil Heating Efficiency	1.0
	U/Factor (W/m² · K)	0.2
	Solar Heat Gain Coefficient (off)	0.45
	Solar Heat Gain Coefficient (level 1)	0.41
ECW	Solar Heat Gain Coefficient (level 2)	0.35
	Solar Heat Gain Coefficient (level 3)	0.22
	Solar Heat Gain Coefficient (level 4)	0.17
	Solar Heat Gain Coefficient (level 5)	0.12

Table 3. Performance comparison between baseline and strong benchmark (

β

= 5).

Table 3. Performance comparison between baseline and strong benchmark (

β

= 5).

Algorithm	ECW	HVAC	TCVR (%)	Energy Efficiency (%)	Temperature Offset (°C)	Electricity (kWh)
Baseline	off	on	–	–	–	57,670
DQN	off	on	0.60	20.16	0.59	46,044
QMIX	on	on	0.12	40.28	0.35	34,495

Table 4. Performance comparison of different MADRL algorithms (

β

= 5).

Table 4. Performance comparison of different MADRL algorithms (

β

= 5).

Algorithm	TCVR (%)	Energy Efficiency (%)	Temperature Offset (°C)	HVAC Electricity (kWh)
MADQN	1.76	38.40	0.34	35,527
MAPPO	0.26	35.99	0.39	36,915
QMIX	0.12	40.28	0.35	34,495
VDN	0.30	39.81	0.50	34,708
DQN	0.50	20.42	1.87	45,894
Baseline	-	-	-	57,670

Table 5. Test results for selected locations (

β

= 5).

Table 5. Test results for selected locations (

β

= 5).

Location	Energy Efficiency (%)	TCVR (%)	Temperature Offset (°C)	HVAC Electricity (kWh)
CHN_SH_Lang.Gang	40.28	0.12	0.51	34,495
HKG_HKI_Hong.Kong	21.01	0.07	0.29	49,257
CHN_GD_Guangzhou	24.81	0.11	0.29	47,332
CHN_SC_Chengdu	48.21	0.13	0.44	29,180
CHN_BJ_Beijing	58.12	0.98	0.78	37,253
USA_CA_Los.Angeles	44.37	0.23	0.72	27,787
SGP_Singapore	16.58	0.00	0.00	69,059

Table 6. Performance comparison of different time intervals (

β

= 5).

Table 6. Performance comparison of different time intervals (

β

= 5).

Algorithm	Time Interval	TCVR (%)	Energy Efficiency (%)	Temperature Offset (°C)	Electricity (kWh)
QMIX	1 h	0.12	40.28	0.35	34,495
QMIX	20 min	0.13	41.08	0.36	34,033
QMIX	5 min	0.09	39.78	0.30	40,560

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, H.; Sun, D.; Sun, Y.; Zhang, Y.; Yang, H. Multi-Agent Deep Reinforcement Learning-Based HVAC and Electrochromic Window Control Framework. Buildings 2025, 15, 3114. https://doi.org/10.3390/buildings15173114

AMA Style

Chen H, Sun D, Sun Y, Zhang Y, Yang H. Multi-Agent Deep Reinforcement Learning-Based HVAC and Electrochromic Window Control Framework. Buildings. 2025; 15(17):3114. https://doi.org/10.3390/buildings15173114

Chicago/Turabian Style

Chen, Hongjian, Duoyu Sun, Yuyu Sun, Yong Zhang, and Huan Yang. 2025. "Multi-Agent Deep Reinforcement Learning-Based HVAC and Electrochromic Window Control Framework" Buildings 15, no. 17: 3114. https://doi.org/10.3390/buildings15173114

APA Style

Chen, H., Sun, D., Sun, Y., Zhang, Y., & Yang, H. (2025). Multi-Agent Deep Reinforcement Learning-Based HVAC and Electrochromic Window Control Framework. Buildings, 15(17), 3114. https://doi.org/10.3390/buildings15173114

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Deep Reinforcement Learning-Based HVAC and Electrochromic Window Control Framework

Abstract

1. Introduction

Related Work

2. System Modeling

2.1. Thermal Comfort Standard

2.2. HVAC-ECW Model

2.3. Multi-Zone Thermal Coupling

3. Methodology

3.1. System Description

3.2. Control Problem Formulation

3.3. Global State and Local Observations

3.4. HVAC-ECW System Actions

3.5. Reward Shaping

3.6. QMIX Algorithm

4. Case Study

5. Result

5.1. Overall

5.2. Baseline Experiment

5.3. Evaluation of QMIX Algorithm Performance

5.4. Energy Efficiency–Comfort Balance Test

5.5. Comparative Analysis of Contemporary MADRL Algorithms

5.6. Verification of Generalization and Robustness

5.7. Impact of ECW Switching Delay on Framework Performance

6. Discussion

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. State Table

Appendix A.2. QMIX’s Hyperparameter

Appendix A.3. RBC Baseline Specification

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI