1. Introduction
To achieve the German government’s goal of a climate-neutral country by 2045, the scenario presented by Prognos [
1] shows a need for a total of 14 million heat pumps in the building sector. This electrification of the heat supply of buildings is often combined with the installation of a photovoltaic (PV) system on the roof to produce renewable energy on-site [
2]. Both the grid consumption of the heat pump and the grid feed-in of the PV system will place a load on the electricity grid that it was often not designed for. However, the need for grid extension can be mitigated by matching electricity demand with on-site generation and peak shaving, both of which reduce the load on the grid. The thermal inertia of the building structure and the thermal storage provide flexibility in the operation of modulating heat pumps by temporally decoupling heat consumption and heat production. Using this flexibility can provide a cheaper alternative or additional flexibility to battery storage [
3].
Besides rule-based control strategies, more advanced approaches using model predictive control (MPC) for PV-optimized control of heat pumps have been developed in the past, e.g., Kuboth et al. [
4]. However, they require high quality forecasts [
5] and cannot be easily implemented due to the high effort of designing and parameterizing each energy system model required for the variety of existing buildings [
6].
To overcome these drawbacks, artificial intelligence (AI) approaches have become increasingly popular for building energy-system control in recent years. In their review article [
7], Alanne and Sierla discuss the learning ability of buildings at a system level and give an overview of machine-learning (ML) applications for building energy management. A major application of ML is the control of heating ventilation and air conditioning (HVAC) systems.
One of the three types of machine learning that is used for system control is reinforcement learning (RL) [
8]. The so-called RL agent learns an optimal control strategy by interacting with the building energy system (environment). It can also learn preferences, such as comfort requirements, by interacting with users. RL is potentially model-free, i.e., unlike the mixed-integer (non)linear programming (MI(N)LP) approaches, which are most commonly used for MPC; there is no need for models of the building energy system. Forecasts for weather and heat/electricity demand are also not required but can be implemented and will improve the results [
9,
10,
11,
12].
(Deep) reinforcement learning (D)RL methods have been investigated for the control of a variety of energy (sub)systems in residential buildings. While Yu et al. [
10] want to “…raise the attention of SBEM [smart building energy management; author’s note] research community to explore and exploit DRL, as another alternative or even a better solution for SBEM.”, Wu et al. [
13] note that reinforcement learning algorithms “…could be promising candidates for home energy management…”.
As the statistics in several publications [
9,
12,
14,
15,
16] show, the number of publications on (deep) reinforcement learning for building energy-system control has been increasing significantly since approximately 2013.
Vázquez-Canteli and Nagy [
9] collected, analyzed, and finally selected 105 articles focusing on RL for demand response (DR), maximizing human comfort, and reducing (peak) energy consumption. They classified the publications according to the type of energy system, RL algorithm, and objective and analyzed whether the approach used a single agent or multiple agents. As one of four major groups of energy systems, heating ventilation and air conditioning and domestic hot water (DHW) systems have been identified as having significant demand response potential. However, many publications on HVAC systems only examine the reduction in thermal-energy demand but not the (flexible) heat production. An energy system that includes an air source heat pump, thermal storage, and a PV system, as described in
Section 2, is not examined in any publication.
Mason and Grijalva [
17] conducted a comprehensive review of RL for autonomous building energy management (BEM), including HVAC systems, water heaters, and home management systems (HMS) that control a variety of combinations of different appliances or energy subsystems. They identified Q-learning as the most commonly used RL method but also a trend towards DRL algorithms. Due to the trial-and-error nature of learning RL agents, direct application to control a real building energy system would result in unacceptable high initial energy costs and occupant discomfort. Therefore, the authors recommend pre-training the RL agent in simulation.
Wang and Hong [
14] provide very detailed statistics on 77 publications that investigate RL for building control. A total of 35.5% of the listed publications have HVAC systems as the subject of control, while only 7.3% focus on thermal energy storage (TES). The former were mostly published between 2015 and 2019, while there is no trend for the latter. Most of the RL methods used are value-based (79.3%); only 12.2% use actor–critic algorithms. None of the reviewed papers consider TES in combination with actor–critic algorithms, which are recommended by Fu et al. [
15] for continuous action-space problems. They give an overview of (D)RL methods and their ability to solve specific problems in the field of building control, e.g., with small/large discrete or continuous action spaces, and show statistics on the algorithms used in the reviewed literature. However, Wang and Hong [
14] conclude that cross-study comparisons are difficult if not impossible due to different environmental settings and benchmarks used in different investigations. Standardized control problems and integrated software tools that combine machine-learning capabilities and building simulation are proposed to overcome this drawback but are lacking.
Perera and Kamalaruban [
16] categorized publications based on seven areas of application of RL approaches, including building energy-management systems. They conclude, among other things, that the lack of use of deep-learning techniques and state-of-the-art actor–critic methods hinders performance improvements.
Yu et al. [
11] also provide a comprehensive review of publications on deep reinforcement learning (DRL) for smart building energy management (SBEM), distinguishing between studies on single building energy subsystems, multi-energy subsystems in buildings, and microgrids. Only three publications on single building energy subsystems examine electric water heaters, and another three examine HVAC systems with the primary goal of reducing energy costs or consumption. The rest do not examine residential buildings. Only one of the publications on multi-energy subsystems in buildings [
18] examines a system similar to the one used in this paper (
Section 2.1). The difference with the system investigated here is the use of an additional backup system of a gas boiler and energy costs as the objective.
Shaqour and Hagishima [
12] analyzed a database of 470 publications on DRL-based building energy-management systems (BEMS) for different building types and reviewed recent advances, research directions, and knowledge gaps. Their selection resulted in 31 publications focused on residential buildings, of which only three used thermal energy storage and investigated RL approaches on a single-family home or district level. Zsembinszki et al. [
19] investigate a very complex building energy system for a single-family house using two water storage tanks, a phase change material storage tank, and a battery storage to provide flexibility to the energy system. Glatt et al. [
20] developed an energy-management system for controlling energy storages of individual buildings in a microgrid using a decentralized actor–critic reinforcement learning algorithm and a centralized critic. Kathirgamanathan et al. [
21] designed and tuned an RL algorithm to flatten and smooth the aggregated electricity demand curve of a building district. The latter two used CityLearn as a simulation environment.
Only two publications [
5,
22] were found that investigate a system configuration identical to the one used in this paper. Ludolfinger et al. [
22] developed a novel recurrent soft actor–critic (RSAC) algorithm for energy cost optimized heat pump control and compared it with four RL-based and one simple rule-based approach. However, their studies refer to a single-family house, and the training and simulation of the RL agent were performed only for a winter period of two and one week, respectively. The models were developed in Modelica, and the RL agent was developed in Python. Langer and Volling [
5] transformed a mixed-integer linear program (MILP) into a DRL implementation and compared the self-sufficiency achieved by a deep deterministic policy gradient (DDPG) algorithm, a model predictive controller (MPC) under full information (theoretical optimum), and a rule-based approach. However, due to their small system dimensions, the buildings considered in Ye et al. [
18] and Langer and Volling [
5] are also very energy efficient single-family houses. Their investigations are based on simplified linear models developed in previous MILP investigations with time-step sizes of one hour, where mass flows and temperatures in hydraulic circuits and thermal storages are not considered in detail.
Most of the publications analyzed in the literature review focus on energy-efficient, single-family buildings and are often based on simplified linearized models originally developed for MILP approaches with hourly time resolution. There are no studies on new or renovated multi-family buildings with an energy system including an air source heat pump, thermal storage, and a PV system.
The novelty of this work is, therefore, the investigation of a DRL-based approach (i) to control the energy system of a multi-family building (ii) using a physical model that takes into account mass flows and temperatures in hydraulic circuits and thermal storages in detail, requiring a variable time-step solver due to the complexity of the model. This approach differs from published work using constant time steps, usually one hour, which is too coarse for real heat pump control. For the first time, the MATLAB software environment [
23] is used with the three toolboxes: Simulink [
24] as development environment; CARNOT [
25] for building energy-system modeling; and Reinforcement Learning [
26] to design, train, and simulate a (D)RL-based controller for a PV-optimized control of an air source heat pump. The purpose of this paper is to show how four selected DRL algorithms perform in this context by evaluating two objectives and comparing them with rule-based controllers. The first objective is to maintain only the required temperatures in the thermal storage. The second objective is to additionally improve the PV self-consumption and self-sufficiency by shifting heat pump operation to times of PV-surplus.
This paper is structured as follows. In
Section 2, the methodology of designing the simulation environment and the controllers is described. The applied building energy system model is introduced, two rule-based control strategies used as a reference are described, and the development of two RL agents, one without and one with PV-optimization, is shown. The training and selection of an RL-based controller as well as the results of annual simulations using (i) the reference rule-based controllers and (ii) the developed RL-based controllers are presented and discussed in
Section 3.
Section 4 concludes the paper and gives an outlook.
2. Methodology
The methodology used for this work is shown in
Figure 1.
To set up the simulation environment, data and parameters for the building and energy-system model were collected, including load and occupancy profiles and weather data. Based on this data, the sub-models for the building, the thermal storage, the heat pump, and the PV system were created and parameterized (
Section 2.1). Due to the modeled mass flows and temperatures in the hydraulic circuits and thermal storage, the complexity of the model and the dynamics, especially for system control, increased compared to simplified linearized models. To solve such problems, very short time steps are required. Therefore, a variable time-step solver was used to reduce the computational effort.
Two rule-based controllers were defined for reference (
Section 2.2). The development of the two RL-based controllers (
Section 2.3) was an iterative process. First, suitable RL methods were selected, followed by the selection of the observations, the definition of the reward function, and the setting of the hyperparameters and the training parameters. The next step was the training of the RL agents and the analysis of the training results. Based on the success of the training, adjustments of the observations, reward function, hyperparameters, and training parameters were made iteratively until the training results were satisfactory. Finally, the two rule-based and two RL-based controllers were used in an annual simulation. The two PV-optimized controllers were compared to each other by the performance indicators PV self-consumption and self-sufficiency (
Section 3).
2.1. Building Energy-System Model
Multi-family buildings account for approximately 41% of the living space in Germany [
27] and, therefore, have a large potential for energy savings and flexibility in energy supply. With 19% (225 million m
2) of the total living space of multi-family buildings in Germany (1168 million m
2), the building age category E of the IWU building typology of Germany [
27] is the largest category for multi-family buildings.
A study on multi-family buildings [
2] investigated low-energy concepts for energy renovations of multi-family buildings and developed, analyzed, and demonstrated solutions for the efficient use of heat pumps, heat transfer systems, and ventilation systems. They show that heat-supply temperatures can be reduced to 60 °C or less by replacing only a few radiators in critical rooms (e.g., 2–7% of the existing radiators in the cases studied), which makes heat pumps a suitable heat-supply solution in this case. Additionally, the energy consumption of such buildings can be reduced by using electricity from an on-site PV system. According to these results, the conventionally renovated version of the building age category E was selected for the investigations.
MATLAB [
23] and the toolboxes Simulink [
24] and CARNOT [
25] were used to model the investigated building with its heat and electricity demand and the energy system consisting of an air source heat pump (hp), a thermal storage, and a PV system as shown in
Figure 2.
The parameters (geometry, U-values, etc.) of the multi-family building with a heated living space of approximately 2850 m
2 divided into 32 apartments were defined according to the data of Loga et al. [
27]. To ensure hygienic air exchange, a ventilation system with a constant ventilation rate of 0.4 1/h is included in the building model. According to the user behavior, additional ventilation is used to prevent an overheating of the building in summer. The internal heat gains were defined according to the occupancy profile for residential buildings given in the SIA guideline 2024 [
28] and scaled to 60 occupants. As input for the simulations as well as for the load profile generation, the test reference year (TRY) of Ingolstadt provided by the German Weather Service [
29] was used. For the generation of the hourly household electricity consumption profile (also used for internal heat gains of electrical appliances) and the domestic hot water (DHW) consumption profile, the method described in the VDI guideline 4655 [
30] was used. The annual household electricity consumption was set at 96,000 kWh/a, and the annual DHW consumption was set at 61,500 kWh/a. A stratified thermal storage tank (total volume 8.5 m
3) is modeled as a hydraulic separator between the heat pump hydraulic circuit and the consumer hydraulic circuit. It decouples the thermal energy consumption from the heat production, thus providing some flexibility in the operation of the heat pump. A south-facing PV system of 65 kW
p is modeled to partially cover the electricity demand of the households and the heat pump. Even in the ideal case of 100% PV self-consumption, 2/3 of the electricity consumed must come from the grid [
31]. The dimension of the PV system was derived from an existing building. The building energy-system parameters are summarized in
Table 1.
2.2. Reference Control Strategies
Two reference rule-based control strategies were defined and used as a reference for the comparison with the RL-based control strategy.
When PV electricity generation is not available or being considered, heat pumps in combination with a thermal storage were often operated in an on/off mode, depending on the temperatures in the thermal storage (basic controller) in former days. In the case shown here, the heat pump is turned on when the top storage temperature drops below 55 °C and runs until the bottom storage temperature reaches that temperature.
There are several rule-based approaches to better match the heat pump electricity consumption and PV generation, often taking advantage of the modulation of the heat pump. The controller used here (PV-optimized controller) is designed to use most of the PV electricity that is not directly used in the households (PV surplus). If there is no PV surplus, the controller only runs the heat pump to keep the top storage temperature at 55 °C to ensure the heat supply. If the PV surplus exceeds the limit defined by the electricity consumption of the heat pump at the lowest modulation level (27 kW
el at 30%), the heat pump is switched on and modulated to consume exactly the PV surplus as long as the bottom storage temperature is lower than or equal to 55 °C. This approach is shown in
Figure 3.
2.3. Development of Reinforcement Learning-Based Heat Pump Controller
According to the RL Toolbox User’s Guide [
32], the workflow for setting up an RL-based controller involves six steps, as shown in
Figure 4.
First, the problem to be solved by the RL agent must be defined (step 1). Two objectives are considered in this study. For the first objective, the RL agent controls the heat pump to ensure temperatures of at least 55 °C in the top of the thermal storage to supply heat to the consumers and to avoid temperatures higher than 55 °C, due to the operating limits of the heat pump, and lower than 45 °C in the bottom of the thermal storage. The action of the RL agent is the control signal to the modulating heat pump in a continuous range between 0 and 1, while values lower than 0.3 switch off the heat pump. The actions are based on the observation of the top (Ttop) and bottom storage temperature (Tbottom), the thermal load for space heating and domestic hot water preparation, and the absolute electricity exchange with the grid (Psum). The second objective is extended to a PV optimization. The storage temperatures should be kept within the previously defined range, while the heat pump should run at times when the PV electricity exceeds the household electricity consumption. Therefore, the observations are extended with information such as PV surplus (PPV,surp), PV generation including historical data and forecasts, and time of day and day of the year to allow for the RL agent to learn diurnal and seasonal correlations.
The building energy-system model described in
Section 2.1 defines the environment (step 2) with which the agent interacts by giving actions and receiving observations and the reward shown next.
The following two reward functions were defined (step 3) for the two objectives mentioned. Equations (1)–(3) were designed only for keeping the storage temperatures within the defined range:
If the temperatures at the top and bottom of the thermal storage become too low and the temperatures at the bottom become too high, a negative reward is given to avoid such conditions.
Equations (4)–(7) were designed to additionally optimize the heat pump operation to times of PV surplus (
PPV,surp). The reward
rt for keeping the storage temperatures in the defined range has been slightly adjusted compared to Equations (1)–(3) due to better learning experienced during training.
In Equation (6), a is the action value (control signal for heat pump) chosen by the RL agent. In the ideal case, there is no electricity consumption from or feed-in into the grid, which results in a positive reward of 5. If there is no PV surplus, the action value a is multiplied by a negative reward of −100 to prevent the agent from choosing high action values that switch on the heat pump charging the thermal storage. In this way, the state of charge of the thermal storage should be kept low until it can be charged by using the PV surplus, which reduces the grid feed-in. This should reduce the amount of power exchanged with the grid, which also increases the PV self-consumption rate.
After the problem is defined, the RL agent can be created (step 4). The RL agent requires a fixed time-step size, which was set to 5 min. Four algorithms were used to train an RL agent:
Soft Actor–Critic (SAC);
Twin-Delayed Deep Deterministic (TD3);
Proximal Policy Optimization (PPO);
Trust Region Policy Optimization (TRPO).
Based on the results (shown in
Section 3.2.1), the algorithm used for validation in an annual simulation (step 6,
Section 3.2) was selected for both objectives.
Due to the high computational effort of training an agent (step 5) for an entire year, it was decided to train the agent only for an episode length of one day, randomly chosen from a predefined set of days distributed throughout the year, to avoid overfitting but still represent the characteristics of an entire year. Based on experiences from previous studies [
31], a backup controller was implemented in the model. It intervenes and overrides the control signal given by the RL agent when the storage temperatures violate the defined temperature range. This ensures that the agent can explore actions that lead to unintended temperature conditions and learn to avoid them because of the negative reward received. Thus, the backup controller switches the heat pump on if the top temperature falls below 55 °C or switches the heat pump off if the bottom temperature exceeds 57 °C. In both cases, a negative or zero reward is given to the RL agent to make it learn to avoid such conditions.
4. Conclusions and Outlook
The detailed literature review conducted in this paper revealed a lack of investigation of RL-based controllers for multi-family building energy systems, including an air source heat pump, thermal storage, and a PV system. This gap is filled by this paper, which presents a novel investigation of RL-based controllers using a physical model that takes into account mass flows and temperatures in hydraulic circuits and thermal storages in detail. The model introduced for the development and simulation of RL-based controllers requires a variable time-step solver due to the complexity of the model. This is also a novelty compared to the related work using simplified models with constant time steps.
To evaluate the performance of RL-based controllers in this context, two rule-based controllers were defined and simulated for reference. One considers only the required temperatures in the thermal storage, and the other also optimizes the operation of the heat pump by shifting it to times of PV surplus to increase the self-consumption and self-sufficiency.
Four RL algorithms were investigated for each of the two RL-based controllers with the same two objectives as the reference controllers. For a successful training of the RL agents, a backup controller had to be implemented in the model to ensure the required storage temperatures. This also made it possible for the agent to explore bad choices resulting in negative rewards. Finally, the soft actor–critic method was selected for the annual simulations due to the highest rewards and a convergent training. It was shown that training a policy on only six defined days of the year results in an RL agent that successfully controls the heat pump to maintain the required storage temperatures for an entire year without intervention of the backup controller. The annual simulations of the RL-based PV optimization of the heat pump operation could only achieve a lower self-consumption rate of 50.3% and self-sufficiency of 17.6% than the rule-based PV-optimized controller (55.2%, 19.2%), while the backup controller had to intervene for 1.415 h per year. Thus, the trade-off between keeping the storage temperatures in the required range and shifting the heat pump operation to times of PV surplus seems to be challenging.
Due to the seasonality of the PV electricity generation (maximum in summer), which is opposite to the seasonality of the heat energy demand of the building (maximum in winter), it is suggested to extend the episode length for training to at least several days or a few weeks per season (winter, transition time, and summer) to cover longer time periods with larger variations in the ratios between heat load and PV electricity generation than only six days have. For this purpose, long short-term memory (LSTM) layers should be implemented in the policy to learn decisions that are based on both short-term and long-term experience. Future work to optimize this RL agent should also include a redesign of the reward function and further investigation of the hyperparameters.
Guidelines or suggestions for the correct choice of hyperparameter values for the different RL algorithms and their application to specific problems to be solved are missing in the literature. It would be very beneficial for future developments and a wider application of (D)RL approaches to collect and analyze hyperparameter settings in order to provide guidelines for developers.