Reinforcement Learning for Optimizing Renewable Energy Utilization in Buildings: A Review on Applications and Innovations

Michailidis, Panagiotis; Michailidis, Iakovos; Kosmatopoulos, Elias

doi:10.3390/en18071724

Open AccessReview

Reinforcement Learning for Optimizing Renewable Energy Utilization in Buildings: A Review on Applications and Innovations

by

Panagiotis Michailidis

^1,2,*,†

,

Iakovos Michailidis

^1,2,*,†

and

Elias Kosmatopoulos

^1,2,†

¹

Center for Research and Technology Hellas, 57001 Thessaloniki, Greece

²

Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Energies 2025, 18(7), 1724; https://doi.org/10.3390/en18071724

Submission received: 18 February 2025 / Revised: 25 March 2025 / Accepted: 27 March 2025 / Published: 30 March 2025

(This article belongs to the Special Issue New Insights into Hybrid Renewable Energy Systems in Buildings)

Download

Browse Figures

Versions Notes

Abstract

:

The integration of renewable energy systems into modern buildings is essential for enhancing energy efficiency, reducing carbon footprints, and advancing intelligent energy management. However, optimizing RES operations within building energy management systems introduces significant complexity, requiring advanced control strategies. One significant branch of modern control algorithms concerns reinforcement learning, a data-driven strategy capable of dynamically managing renewable energy sources and other energy subsystems under uncertainty and real-time constraints. The current review systematically examines RL-based control strategies applied in BEMS frameworks integrating RES technologies between 2015 and 2025, classifying them by algorithmic approach and evaluating the role of multi-agent and hybrid methods in improving real-time adaptability and occupant comfort. Following a thorough explanation of a rigorous selection process—which targeted the most impactful peer-reviewed publications from the last decade, the paper presents the mathematical concepts of RL and multi-agent RL, along with detailed summaries and summary tables of the integrated works to facilitate quick reference to key findings. For evaluation, the paper examines and outlines the different attributes in the field considering the following: methodologies of RL; agent types; value-action networks; reward functions; baseline control approaches; RES types; BEMS types; and building typologies. Grounded on the findings presented in the evaluation section, the paper offers a structured synthesis of emerging research trends and future directions, identifying the strengths and limitations of RL in energy management.

Keywords:

reinforcement learning; renewable energy; building energy management; adaptive control; model-free control; smart buildings

1. Introduction

1.1. Motivation

Energy efficiency has become a global priority, driven by rising energy costs, environmental concerns, and the urgent need to curb greenhouse gas emissions [1,2]. Buildings, among the largest energy consumers worldwide, are responsible for approximately 34% of global energy use and carbon emissions, particularly associated with urban areas [3]. Given their significant environmental impact and operational costs, improving building energy efficiency has become a cornerstone of sustainability efforts. A key strategy in achieving this goal concerns the integration of renewable energy systems (RES), such as solar photovoltaic panels (PVs) and wind turbines (WTs), within building infrastructures [4]. By leveraging on-site renewable energy sources, buildings may decrease reliance on conventional power grids [5], reduce energy costs [6], and contribute to broader sustainability objectives [7].

The concept of integrating RES into buildings gained traction in the late 20th century, fundamentally reshaping how energy is generated and managed in the built environment. Early initiatives primarily focused on solar thermal and PV systems, which initially emerged in pilot projects during the 1970s in response to the global oil crisis [8,9]. As renewable technologies advanced and the environmental consequences of fossil fuels became more apparent, governments and institutions worldwide increased their investment in RES for buildings [10]. By the 1990s, improvements in PV efficiency and cost reductions enabled more widespread adoption, which led early efforts in promoting renewable energy integration across residential and commercial sectors [9,11]. In more recent years advanced policies, such as the European Union’s Green Deal [12], played a pivotal role in promoting RES integration by setting ambitious climate targets and providing financial mechanisms like the REPowerEU [13], Renewable Energy Directive [14], and LIFE Programme [15] to support regions transitioning from fossil fuels. As is evident, such initiatives inevitably paved the way for the broader adoption of RES in buildings by providing a strong regulatory framework, financial support, and streamlined integration pathways. As RES penetration increased, the role of building energy management systems (BEMS) became indispensable in optimizing energy consumption, demand response (DR), and grid interaction.

While integrating RES in buildings offers significant benefits, including reducing reliance on conventional energy, lowering costs, and enhancing sustainability, it also poses significant challenges, including policy constraints, high costs, maintenance demands, installation disruptions, and the need for increased energy flows optimization, demand forecasting, and system management, alongside stability concerns associated with renewable energy fluctuations, grid integration, and real-time control [16,17,18]. Unlike conventional energy systems, RES are inherently variable, with outputs fluctuating based on external factors such as weather conditions and time of day. For example, solar energy generation is directly affected by sunlight availability, which varies with daily and seasonal cycles. This variability complicates energy management, requiring buildings to adapt their energy consumption patterns to align with high RES output periods while relying on storage solutions or grid support during low-production intervals [19,20]. Moreover, additional dynamic factors, including occupancy levels, energy pricing, and climate conditions, further increase the complexity of managing RES-equipped buildings [21]. Traditional rule-based control (RBC) approaches often fall short in addressing these complexities, as they lack the flexibility and adaptability required to respond effectively to fluctuating conditions [22,23,24]. Consequently, there is a growing need for advanced, intelligent control methodologies that can optimize energy use and enhance efficiency in RES-integrated building environments.

To overcome these challenges, Artificial Intelligence (AI) and machine learning-based control frameworks have emerged as transformative tools in BEMS [25,26,27,28]. Such intelligent systems are adequate to dynamically adjust to real-time fluctuations in occupancy, RES output, and energy prices, optimizing the operation of integrated energy subsystems, such as heating, ventilation, and air conditioning (HVAC), lighting systems (LS), and energy storage systems (ESS) or thermal storage systems (TSS) [29,30]. BEMS control strategies typically fall into two primary categories: model-based and model-free approaches [31]. Model-based methods rely on predefined mathematical representations to simulate system behavior, enabling predictive and precise control [32,33,34]. The efficiency of such frameworks is highly dependent on the accuracy of these models, making them complex and often expensive to implement, particularly in highly variable systems with uncertain parameters. On the other hand, model-free approaches do not require prior system modeling; instead, they learn optimal control strategies by interacting directly with the environment [31,35,36,37].

According to the literature, reinforcement learning (RL) stands out as one of the most prominent model-free approach for managing building energy systems, including HVAC systems and similar frameworks (see Figure 1) [38]. In model-free RL, an agent learns optimal control policies through trial and error, continuously refining its strategy by exploring various actions and maximizing rewards [39,40,41,42]. The application of RL in managing RES within buildings dates back to the early 2000s, when researchers began leveraging its adaptive learning capabilities to address the inherent variability in RES output. Until then, a significant number of studies demonstrated RL’s effectiveness in dynamically optimizing building energy management, particularly in HVAC operations under fluctuating environmental conditions [43,44,45].

Over the past decade, RL has firmly established itself as a promising model-free control strategy, tackling the complexities of RES integration in smart buildings and paving the way for more sophisticated applications in the years ahead. Its growing prominence in building energy management is largely attributed to its adaptability and capacity to handle the stochastic nature of RES [46,47]. Unlike model-based control methods, RL-based approaches are adequate to dynamically adjust to real-time system and environmental variations, making them particularly well suited for highly variable energy systems, such as those incorporating RES. A key advantage of RL relies in its ability to learn directly from interactions with the environment, refining its control policy over time without requiring a predefined system model—a feature that has proven especially valuable for complex energy management scenarios [47]. To this end, over the past years, the RL approach has demonstrated its transformative potential in building energy efficiency by optimizing energy consumption, reducing operational costs, and improving occupant comfort [48]. Recent advancements in algorithmic design, function approximation, and real-time adaptability have significantly enhanced RL’s ability to manage complex building energy systems. As such, numerous studies highlight improvements in reward shaping, multi-agent coordination, and self-learning mechanisms, allowing RL-based controllers to dynamically adjust HVAC, lighting, and renewable energy storage, leading to measurable energy savings and benefits [49,50,51]. Moreover, it is evident that RL introduced a significant number of hybridized approaches for optimizing energy systems, moving beyond traditional control methods, shaping next-generation energy management solutions. For instance, in [52], scientists proposed a Deep Q-Network RL approach with fuzzified reward mechanisms for PV Maximum Power Point Tracking (MPPT), improving performance in dynamic conditions without relying on explicit system models. Moreover, in [53], researchers applied a Twin Delayed Deep Deterministic Policy Gradient to pulverized coal boiler combustion, integrating predictive modeling for real-time optimization.

Recognizing the pivotal role of RES in sustainable building practices, this review focuses specifically on RL-based control strategies for optimizing RES operation in buildings, alongside other integrated energy systems. By analyzing recent RL applications within BEMS that incorporate RES into the energy mix, this work examines high-value research contributions to highlight how RL enhances energy efficiency and operational intelligence in building environments. The analysis spans various building types, control frameworks, and both real-world and simulated implementations, offering a comprehensive perspective on the current state of RL-based control in the field. The findings of this review provide valuable insights for researchers and practitioners aiming to implement intelligent, adaptive control strategies in RES-integrated buildings, ultimately supporting the advancement of sustainable and resilient building technologies.

1.2. Literature Analysis Approach

This review aims to provide a comprehensive exploration of key studies on the application of RL for controlling RES in buildings, highlighting significant trends and critical findings in the field. By systematically analyzing research from the past decade, this work evaluates fundamental concepts, control methodologies, RL algorithms, and their specific applications in managing RES within building environments. To ensure a structured and in-depth examination, studies are categorized based on RL control approaches—value-based, policy-based, and actor-critic—along with training methodologies and different RES-integrated BEMS. Additionally, details regarding testbed characteristics are included to offer a detailed overview of the current state of RL control, ensuring a well-rounded evaluation of each selected study.

Article Selection: A rigorous selection process was conducted using academic databases such as Scopus- and Web of Science (WoS)-indexed peer-reviewed journals and conferences to ensure quality and reliability. An initial pool of over 300 papers was reviewed based on abstracts, from which the most relevant studies were shortlisted for detailed analysis. More specifically, the work followed a multi-step quality assessment process considering the following: Citation Impact (studies with at least 10 citations—excluding self-citations—were selected to ensure academic influence, verified via Scopus at the time of selection); Relevance (only papers explicitly addressing RL-based control for RES-integrated BEMS were included, excluding studies on fault detection, generic demand-side management, or isolated RES control without RL); Peer Review (only peer-reviewed journal articles and high-quality conference proceedings—IEEE, Elsevier, Springer, MDPI—were considered, excluding preprints and non-peer-reviewed reports); Rigor (selected studies had to clearly describe RL implementation, optimization objectives, and evaluation frameworks, with benchmark comparisons and experimental validation); Diversity (a balanced selection of value-based, policy-based, actor-critic, and hybrid RL methods was ensured, covering various building types—residential, commercial, university, hospital—and RES-Integrated BEMS setups).
Keyword Research: A comprehensive keyword analysis was performed, incorporating terms such as “Reinforcement Learning in RES for buildings”, “RL control in BEMS”, “RL-based energy management” and other specific phrases related to RES integration. This approach ensured a broad yet precise capture of the challenges and advancements in RL-based RES management.
Data Collection: Each publication was systematically categorized based on the RL techniques applied to RES control, the integration of additional energy systems, the specific application context, and evaluations of the advantages, limitations, and practical implications within building energy management.
Quality Assessment: A structured quality assessment was carried out, considering citation count, the academic contributions of authors, and the methodological rigor of each study. More specifically, the citation-based selection concerned a total number of more than ten citations acting as a measure of academic impact, ensuring the inclusion of studies with established influence. Citation counts, retrieved from Scopus and excluding self-citations, provided reliability. Preference was given to publications in high-impact journals and conferences. Moreover the research background of authors in RL, RES, BEMS was evaluated based on their publications in top-tier journals, contributions to widely adopted RL algorithms or methodological advancements in energy management, and affiliations with leading research institutions or energy-focused labs. Studies from authors with a strong track record in RL-based energy optimization were prioritized to ensure credibility.
This evaluation helped determine the relative significance of each research work’s contribution in the field.
Data Synthesis: Findings were synthesized into distinct categories, enabling clear comparisons across studies and facilitating a holistic understanding of the evolving landscape of RL-based control for RES in buildings.

1.3. Previous Work

The literature presents a wealth of studies exploring the application of RL in managing various energy systems. Among these, Wang et al. [46] conducted a comprehensive review of RL-based building control strategies, emphasizing their potential to enhance energy efficiency and adaptability. The study identified key challenges, including data-intensive training, security vulnerabilities, limited real-world adoption, and poor generalization. To address these issues, the authors highlighted the importance of transfer learning and the development of open-source testbeds to advance RL controller design and benchmarking. Al et al. [45] focused specifically on RL applications for HVAC control in intelligent buildings, underscoring its effectiveness in optimizing energy consumption while maintaining occupant comfort. By comparing RL-based approaches with conventional HVAC control methods, the study demonstrated RL’s superiority in overcoming limitations such as high computational demands and a lack of adaptive learning capabilities. A significant contribution by Perera et al. [54] examined RL applications across broader energy systems, classifying existing research into seven key categories, with a strong emphasis on building energy management and dispatch optimization. The study reinforced RL’s potential to enhance energy system efficiency while also identifying critical research gaps, including the limited integration of deep learning techniques, benchmarking inconsistencies, and challenges related to reproducibility.

Gaviria et al. [55] extended the discussion to machine learning frameworks for optimizing PV system operations. Their review highlighted gaps in benchmarking, real-world validation, and the integration of advanced machine learning models. The study proposed future directions, such as leveraging RL for Maximum Power Point Tracking (MPPT) and exploring the use of transformer-based models for improved PV performance forecasting. Fu et al. [56] provided a focused review of RL applications in building energy efficiency, particularly in HVAC optimization. The study systematically classified RL algorithms, distinguishing value-based methods, e.g., Q-learning (QL), Deep-Q-Networks (DQN), for discrete action spaces and policy-based methods, e.g., DDPG, proximal policy optimization (PPO), for continuous control tasks. While the research underscored RL’s advantages over traditional control techniques, it also highlighted persistent challenges, such as prolonged training times and the high costs associated with real-world deployment.

1.4. Contribution and Novelty

This review distinguishes itself from the existing literature on RL control in RES-integrated BEMS through several key contributions. Firstly, it is among the few studies that specifically focus on RL control frameworks designed for the optimal operation of RES—including, PV panels, WT, solar water heaters (SWHs), ground heat pumps (GHPs), and biomass energy systems (BIO)—alongside a diverse set of energy management components, such as HVAC, domestic hot water (DHW, LS, ESS), and electric vehicles (EVs) within building environments.

Unlike previous reviews, this work conducts a large-scale, in-depth analysis of RL applications, systematically evaluating a substantial number of influential research studies published between 2015 and 2025. By encompassing nearly all RL-focused studies in this domain that have received more than ten citations (>10), this review ensures that only high-impact contributions inform the conclusions drawn in subsequent sections. Through the synthesis of key findings from the past decade, this study highlights the adaptability and effectiveness of RL-based control strategies in optimizing RES-integrated BEMS.

Additionally, this review presents detailed summary tables that highlight the most influential studies, enabling readers to efficiently identify relevant research and compare different RL approaches. Beyond summarizing existing works, this study provides a thorough evaluation of critical research elements, including the specific RL methodology employed, the agent-based algorithmic framework, the target optimization outputs, baseline control strategies, RES configurations, and building typologies. By adopting this comprehensive approach, the review not only identifies valuable insights and emerging trends but also lays the groundwork for future advancements in RL-based control for RES management.

1.5. Paper Structure

The structure of this paper is as follows (Figure 2): Section 1 introduces the motivation behind this review, outlines the literature analysis approach, examines previous studies, and highlights the paper’s key contributions and novelty. Section 2 provides an overview of commonly integrated RES technologies in buildings and reviews general RL-based control strategies for managing RES-integrated BEMS. Section 3 delves into the primary RL control methodologies, presenting the generalized mathematical framework. Section 4 examines the most highly cited research works from 2014 to 2024, summarizing key characteristics in tabular format for comparative analysis. Section 5 evaluates various research dimensions, comparing prevalent RL methodologies and their effectiveness. Section 6 identifies emerging trends based on the evaluation and outlines potential future directions for RL-based control in RES-integrated systems. Finally, Section 7 concludes with a summary of the key insights and findings presented in this review.

2. Renewable Energy Systems in the Building Level

2.1. Primary RES Types

RES commonly integrated into buildings include various technologies designed to enhance energy efficiency and sustainability. Commonly integrated RES in buildings may include: [57,58]:

Solar Photovoltaic Systems: PV systems generate electricity by converting sunlight directly into electrical energy through rooftop or facade-mounted solar panels. Their widespread adoption is driven by decreasing costs and ease of installation.
Solar Water Heating Systems: Utilizing solar energy to produce heat, SWH systems are commonly used for water and space heating. They operate through solar collectors that absorb sunlight and transfer heat to a working fluid, which then supplies thermal energy to the building’s water heating or HVAC system.
Wind Turbine Systems: Small-scale WT can be installed near buildings to harness wind energy for electricity generation. However, their feasibility is highly dependent on local wind conditions, zoning restrictions, and structural integration.
Geothermal Heat Pumps: Also known as ground-source heat pumps, leverage the earth’s stable underground temperature for heating and cooling. By circulating a heat-exchange fluid through buried pipes, they provide an energy-efficient alternative to conventional HVAC systems.
Biomass Energy Systems: Biomass systems convert organic materials—such as wood pellets, agricultural residues, or other bio-based fuels—into heat or electricity. They are particularly advantageous in regions with abundant biomass resources.

The adoption of RES in buildings is influenced by multiple factors, including energy demand, available space, budget constraints, climate conditions, government incentives, and grid connectivity. Residential buildings predominantly incorporate solar PV and solar thermal systems, benefiting from decreasing solar panel costs and sufficient rooftop space, enabling self-consumption and reduced carbon footprints [59]. Commercial buildings, characterized by larger roof areas and substantial energy requirements, frequently deploy solar PV and may integrate GHP systems for efficient heating and cooling solutions [60].

Public buildings, including government and municipal structures, are increasingly adopting solar PV and other RES technologies to align with sustainability policies and regulatory targets [61]. Educational institutions, such as universities and schools, are expanding PV adoption to lower operational costs while incorporating sustainability initiatives into their curricula [62]. Meanwhile, hospitals—given their high and continuous energy demands—exhibit a lower reliance on onsite RES, typically generating only about 7% of their electricity needs from renewable sources. Due to their critical power reliability requirements, hospitals often supplement energy generation with backup systems and grid connectivity [59,62].

One of the most significant parameters for RES adoption is the efficient architectural integration of such technologies in buildings representing a crucial aspect of sustainable urban development, ensuring both functional efficiency and aesthetic harmony [63]. For instance building-integrated photovoltaics (BIPVs), where solar panels are seamlessly embedded into façades, roofs, and windows, exemplify how RES can be integrated without compromising architectural design [64]. Similarly, building-integrated WT, solar thermal collectors, and green roofs contribute to on-site energy generation while maintaining visual appeal. Advanced materials, such as transparent solar glass and PV shingles, further enhance the feasibility of RES integration in modern architecture [65]. Such innovations, coupled with digital design tools and parametric modeling, allow architects to optimize energy performance while preserving design flexibility.

Overall, RES adoption varies across building types, shaped by energy needs, infrastructure constraints, and economic feasibility, reinforcing the importance of tailored integration strategies for maximizing renewable energy utilization.

2.2. General Concept of Reinforcement Learning Control in BEMS

In an RL-based building energy management system, four key elements define the control process: environment, sensors, environment update, and reward computation. The overall operation of RL control considering a building energy management system integrated with RES and other various subsystems may be described in the following steps shown in Figure 3:

2.3. General Concept of Reinforcement Learning Control in BEMS

The fundamental operation of RL control within BEMS (see Figure 3) may be described through the following key steps:

Environment: The environment represents all external factors that the RL agent does not directly control but must respond to and account for. This includes both indoor and outdoor conditions, such as weather, occupant presence, building physics, PV/wind energy production, and battery states, as well as external signals like electricity prices and demand-response requests. Additionally, it encompasses constraints imposed by the power grid or regulatory frameworks. In essence, the environment encapsulates the entire building system and its dynamic interactions with external influences over time.
Sensors: Sensors play a crucial role in collecting real-time data on the state of the environment, providing the RL agent with the necessary observations for decision-making. In a building context, this typically includes measurements of indoor temperature, occupancy, energy consumption, PV/wind energy output, and battery state of charge. Sensor inputs can originate from physical devices, such as temperature sensors, occupancy detectors, and power meters, or from virtual data streams, including electricity price signals and weather forecasts.
RL Agent: The RL agent serves as the intelligent decision-making entity, processing environmental observations and selecting optimal control actions to enhance energy efficiency, reduce costs, and maintain occupant comfort. Depending on the RL approach, the agent’s learned policy may be represented using a Q-table (in traditional RL) or an artificial neural network (ANN) in DRL. The agent makes key operational decisions, such as adjusting HVAC setpoints, scheduling battery storage operations, managing EV charging, and optimizing renewable energy dispatch.
Control Decision Application: Based on the learned policy, the RL agent applies its selected control actions in real time to optimize the building’s energy management. These decisions might involve regulating HVAC settings to maintain indoor comfort while minimizing energy consumption, scheduling battery charging and discharging to maximize renewable self-consumption and reduce peak loads, or managing PV output, determining whether to store excess solar energy, use it immediately, or feed it into the grid based on real-time electricity prices and demand.
Environment Update: After the RL agent implements its decisions, the environment undergoes an update, reflecting the impact of the chosen actions. This includes changes in building physics, occupant behavior, weather conditions, and grid interactions, which collectively determine the new environmental state. Mathematically, this transition captures how the building and its systems evolve over time in response to both internal control strategies and external dynamics.
Reward Computation: Following the environment update, a numerical reward is calculated to assess the effectiveness of the RL agent’s actions in the given timestep. The reward function may consider multiple factors, such as energy cost savings, occupant comfort levels, peak load reduction, adherence to thermal constraints, and environmental impact.
Reward Signal Feedback: Finally, the computed reward is fed back into the RL algorithm, enabling the agent to refine its policy (whether through QL, an ANN, or another RL approach). By continuously interacting with the environment, learning from past actions, and adjusting its strategy accordingly, the RL agent progressively improves its decision-making to better achieve objectives such as minimizing costs, maintaining thermal comfort, and maximizing on-site renewable energy utilization.

3. Mathematical Framework of Reinforcement Learning

The integration of RES in buildings has become a critical strategy for reducing carbon footprints and enhancing overall energy efficiency. RL has emerged as a powerful solution for managing the inherent complexity and stochastic nature of RES, enabling dynamic optimization of energy generation, storage, and consumption. Unlike conventional model-based control methods, RL is inherently adaptive, allowing it to respond to environmental variations and unforeseen disturbances. This makes it particularly well suited for addressing the intermittency of renewable energy sources such as solar and wind power.

RL formulates the control problem as a Markov Decision Process (MDP), wherein the objective is to learn optimal policies through iterative trial-and-error interactions with the environment [66,67]. This adaptive learning framework is especially valuable in modern smart buildings, where effective RES integration requires dynamic load balancing, intelligent ESS management, and real-time responsiveness to fluctuating demand. By continuously refining its control strategies, RL enables more efficient and resilient building energy management, ensuring that renewable resources are utilized optimally in diverse and unpredictable conditions.

3.1. The General Concept of RL

The mathematical framework of RL may be summarized as follows: RL can be formalized through the MDP, defined by the tuple

(S, A, P, R, γ)

. Here, S is the set of possible states representing the environment; A is the set of actions available to the agent;

P (s^{'} | s, a)

denotes the transition probability from state s to

s^{'}

under action a;

R (s, a)

is the reward function, providing feedback based on the action taken; and

γ \in [0, 1]

is the discount factor that balances immediate and future rewards.

The objective is to maximize the cumulative reward

G_{t} = \sum_{k = 0}^{\infty} γ^{k} R_{t + k}

by learning a policy

π (a | s)

that determines the best action to take in a given state. The optimal policy

π^{*}

maximizes the expected return

V^{π} (s) = E [G_{t} | s_{t} = s]

. The action-value function

Q^{π} (s, a)

may be defined as:

Q^{π} (s, a) = E [G_{t} | s_{t} = s, a_{t} = a]

(1)

This function is iteratively updated through algorithms such as QL, where:

Q (s, a) \leftarrow Q (s, a) + α [R + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)]

(2)

where

α

is the learning rate.

3.2. Multi-Agent Reinforcement Learning

In a multi-agent extension of RL, a set of N agents interacts with the environment under the framework of a Markov game (or stochastic game). Let

s \in S

denote the shared or partially shared state, and let

a_{i} \in A_{i}

be the action of the ith agent [20,68]. The transition dynamics are given by:

P (s^{'} | s, a_{1}, \dots, a_{N}),

(3)

which specifies the probability of transitioning to state

s^{'}

when agents choose actions

a_{1}, \dots, a_{N}

. In a cooperative building energy management scenario, agents often share a global reward function

R (s, a_{1}, \dots, a_{N})

and collectively aim to maximize the expected return [68]. Each agent i may maintain a policy

π_{i} (a_{i} ∣ s)

, which in value-based methods is refined by approximating a state-action value function

Q_{i} (s, a_{i})

.

In practice, these functions are often learned via ANNs called Value networks (critics) and Action networks (policies/actors) [69]. The value network estimates the expected return, for example,

Q_{i}^{π} (s, a_{i})

, while the action network outputs the probability distribution over possible actions (or a direct action map)

π_{i} (a_{i} ∣ s)

. Each agent updates these networks iteratively via algorithms, coordinating to optimize energy generation, storage, and consumption in the building environment [70].

We might say that value networks (critics) and action networks (policies) operation may be organized around three core dimensions [71]: Structure, covering how critics and policies are shared among agents; Training, describing whether learning is centralized, decentralized, or mixed; and Coordination, which ranges from fully implicit interactions to explicit communication or hierarchical control. Each dimension influences scalability, data requirements, and overall system performance under varying building and renewable integration scenarios [71,72,73]. More specifically, the subtypes of such elements may be described as follows:

Structure: Determines how value and policy networks are shared among agents, ranging from centralized (shared networks) to decentralized (individual networks), impacting coordination and scalability. They may be categorized as follows:

One centralized critic with separate policy networks: A single critic processes global or aggregated information, while each agent retains its own policy.
Fully separate critics and policies: Each agent independently learns both value functions and policies, with no shared parameters or central critic.
A shared critic and policy across all agents: All agents operate under the same critic and policy, effectively functioning as a unified controller.
Hybrid/partially shared: Certain components (e.g., parts of the critic or specific layers in the policy networks) are shared, while others remain agent-specific.

Training: Refers to the learning paradigm—centralized, decentralized, or hybrid—that dictates how agents update their networks, influencing learning efficiency and policy development. They may be categorized as follows:

Centralized Training, Decentralized Execution (CTDE): Agents leverage global or centralized information during learning but act independently in real-time.
Fully Decentralized Training: Each agent learns solely from its own local observations and experiences, with no shared critic or global oversight.
Fully Centralized Training: A single entity holds and updates all agent parameters, effectively treating the multi-agent environment as one large system.
Mixed/Hybrid Training: Combines centralized feedback with decentralized updates, or vice versa, to balance local autonomy and global coordination.

Coordination: Encompasses the mechanisms through which agents exchange information or rewards, affecting the convergence and effectiveness of their collective policies. Coordination or communication schemes among agents may be categorized as follows:

Implicit Coordination: Agents rely on shared objectives, rewards, or a global critic but do not communicate directly.
Explicit Coordination: Agents exchange information or messages, enabling direct negotiation or data sharing for coordinated decision-making.
Emergent Coordination: Cooperative behavior arises through repeated interactions within the environment, without any explicit mechanism or communication.
Hierarchical Coordination: Leader-follower roles or multi-level decision architectures establish structured control and optimize collaborative objectives.

Structurally, a design with one centralized critic means all agents share a value network, while fully separate critics means each agent trains its own—both choices also dictate whether policy (action) networks are similarly shared or independent. From a training standpoint, even if the networks are centralized or partially shared, agents may still learn under a mixed or decentralized scheme, each updating its own parameters with only limited global oversight. Finally, coordination mechanisms—implicit or explicit—determine how agents exchange information or rewards and thus whether their value or action networks converge toward cohesive multi-agent policies or evolve more independently.

3.3. Common RL Algorithms for Energy Management

According to the literature, RL for energy systems management may be categorized into three main approaches [40]: value-based, policy-based, and actor-critic methods, each operating uniquely to optimize energy systems. More specifically, these are characterized as follows:

3.3.1. Value-Based Algorithms

Such RL methodologies focus on estimating the value of actions in a given state. In order to do so, a value function is utilized in order to guide decision-making by selecting the action with the highest expected reward. According to the literature, value-based RL approaches are particularly effective for discrete control tasks, such as determining optimal time slots for battery charging or scheduling energy loads. However, they struggle with large or continuous action spaces due to the curse of dimensionality. Common value-based algorithms found in the literature considering energy management concern QL and DQN [74]. Their mathematical conceptualization may be expressed as follows [40]:

Q-Learning: QL is an off-policy algorithm that learns the optimal action-value function regardless of the policy being followed. This algorithm is widely used for RES control due to its simplicity and ability to converge to the optimal policy without requiring a model of the environment [75]. However, QL struggles with high-dimensional state spaces, leading to slow convergence [76]. It is primarily applied in small-scale RES systems, such as individual building HVAC control or battery storage management. QL learns the value of state-action pairs, $Q (s, a)$ , through iterative updates using the Bellman equation:

$Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [R (s_{t}, a_{t}) + γ max_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})]$

(4)

where $α$ is the learning rate. QL performs well in discrete action spaces and has been applied to HVAC optimization in buildings [77].
Deep Q-Networks: DQN addresses the limitations of QL by utilizing deep ANNs to approximate the Q-function, enabling the handling of complex, high-dimensional environments. DQN leverages experience replay and target networks to stabilize learning [66]. This makes DQN suitable for larger RES configurations where multi-variable control is required, such as managing distributed energy resources in commercial buildings. However, DQN is computationally intensive and requires significant tuning, making deployment in real-time systems challenging. DQN extends the QL approach by using deep ANNs to approximate $Q (s, a)$ values, enabling the handling of large state spaces [78]. The update rule may be expressed by the following equation:

$Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [R (s_{t}, a_{t}) + γ max_{a} Q (s_{t + 1}, a; θ) - Q (s_{t}, a_{t}; θ)]$

(5)

3.3.2. Policy-Based Algorithms

In contrast to value-based approaches, the policy-based methodologies bypass the value function and directly optimize a policy to determine actions. Such approaches are well suited for continuous action spaces, enabling smooth control of solar trackers, or inverters. However, while they perform better in high-dimensional action spaces, they may converge to suboptimal solutions without careful tuning of hyperparameters [40]. Common policy-based algorithms found in the literature considering energy management concern primarily the PPO algorithm. The mathematical concepualizationt of the algorithm may be expressed as follows:

Proximal Policy Optimization (PPO): PPO entails primarily a policy-based approach because it directly optimizes the policy (the actor) by maximizing an objective function related to expected rewards. The policy is typically represented by a neural network that outputs action probabilities (for discrete actions) or parameters of a distribution (for continuous actions). PPO improves stability by limiting the policy update step, thereby preventing drastic policy changes [79]. PPO has been effectively applied to BEMS, particularly for optimizing HVAC systems and renewable energy dispatch. PPO strikes a balance between performance and computational efficiency, making it popular for large-scale applications [80].

$L^{C L I P} (θ) = E [min (r_{t} (θ) A_{t}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})]$

(6)

where $r_{t} (θ) = \frac{π (a_{t} | s_{t})}{π_{o l d} (a_{t} | s_{t})}$ and $A_{t}$ concern the advantage function.

3.3.3. Actor-Critic Algorithms

Actor-critic methods combine the strengths of both value-based and policy-based approaches. They consist of two components: the actor, which selects actions, and the critic, which evaluates the actions by estimating a value function. This hybrid design improves training stability and learning efficiency. Such algorithms are frequently used in complex building systems, coordinating multiple RES components to optimize ESS, consumption, and grid interactions [40]. Common actor-critic algorithms found in the literature considering energy management concern the following approaches: Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC), and Advantage Actor-Critic (A2C). Their mathematical conceptualization may be expressed as follows:

Deep Deterministic Policy Gradient: DDPG presents an actor-critic algorithm that combines the strengths of policy gradient methods and QL. It is well suited for continuous action spaces, making it advantageous for applications such as battery energy storage control, where actions like charge/discharge rates are continuous [81]. DDPG’s major advantage is its ability to handle high-dimensional, continuous control tasks. However, it is sensitive to hyperparameters and prone to instability during training. DDPG portrays an off-policy algorithm suited for continuous action spaces, combining actor-critic architectures [81]. The policy (actor) updates based on gradients derived from the critic’s Q-function.

$C r i t i c U p d a t e : Q (s_{t}, a_{t}) \leftarrow R (s_{t}, a_{t}) + γ Q (s_{t + 1}, μ (s_{t + 1}))$

(7)

$A c t o r U p d a t e : \nabla_{θ} J (θ) = E [\nabla_{θ} Q (s_{t}, μ (s_{t}; θ)) \nabla_{θ} μ (s_{t}; θ)]$

(8)
Soft Actor-Critic (SAC): SAC maximizes entropy, encouraging exploration and preventing premature convergence to suboptimal policies. Such algorithms have proven effective in RES control for smart grids and microgrids, where the environment is highly dynamic and stochastic [82]. SAC’s primary benefit lies in its robustness to environmental noise and ability to achieve high performance across varying conditions. However, it requires substantial computational resources. SAC achieves sample efficiency and is well suited for real-time energy management applications but requires careful balance of the entropy coefficient $α$ .

$J (π) = \sum_{t} E [R (s_{t}, a_{t}) + α H (π (\cdot | s_{t}))]$

(9)
Advantage Actor-Critic: A2C concerns a synchronous, deterministic version of the asynchronous actor-critic framework. It uses both a policy network (actor) to choose actions and a value network (critic) to evaluate them, optimizing both simultaneously to improve learning efficiency. In the context of RES control for buildings, A2C is particularly useful for tasks that involve sequential decision-making under uncertainty, such as demand-side energy management and dynamic load scheduling [83]. A2C’s key advantage lies in training stability, in comparison to purely policy-based or value-based methods. A2C leverages the advantage function to reduce variance in policy gradient estimation, which accelerates convergence. However, its performance may be sensitive to hyperparameter tuning and might struggle with very high-dimensional action spaces without additional enhancements.
The advantage function $A (s_{t}, a_{t})$ is calculated as the difference between the value of taking an action in a given state and the baseline value of the state:

$A (s_{t}, a_{t}) = R (s_{t}, a_{t}) - V (s_{t})$

(10)

The policy gradient is updated to maximize the expected reward with reduced variance:

$\nabla_{θ} J (π_{θ}) = E_{π_{θ}} [\nabla_{θ} log π_{θ} (a_{t} | s_{t}) A (s_{t}, a_{t})]$

(11)

A2C is well suited for moderately complex RES applications but can require enhancements like parallel training for scaling to larger and more complex environments.

In general, RL presents a transformative approach to controlling RES in buildings by adapting to the stochastic nature of renewable sources and optimizing energy consumption patterns dynamically. The choice of an RL algorithm depends on the specific application, with QL and DQN being suitable for discrete environments, while DDPG, PPO, and SAC excel in continuous, large-scale RES management. Future advancements in RL will further enable autonomous, resilient, and efficient energy systems in smart buildings.

4. Tables and Summaries of RL Applications

This section provides a structured analysis of RL applications in RES-integrated BEMS from 2015 to 2025. The Summarized Tables subsection offers a high-level overview, highlighting key attributes of each study, while the Summaries of Applications subsection provides a detailed description of the research scope and outcomes. This approach allows readers to quickly identify relevant applications in the tables and refer to the detailed summaries for deeper insights into their methodologies and findings.

4.1. Summarized Tables

Before proceeding to the actual summaries on the different applications, this section presents five summarized tables categorizing the studies based on the type of RL type employed (namely, value-based, policy-based, actor-critic, hybrid, and other applications—see Table 1, Table 2, Table 3, Table 4 and Table 5). Such tables systematically classify each application according to key characteristics (e.g., reference, year of publication, RL method, agent type, BEMS combination, building type (residential/commercial), application type (simulation/real-life), and number of citations), ensuring a comprehensive understanding of the methodologies used in the field. More specifically the Summarized Table columns identify:

Ref.: illustrating the reference application in the first column;
Year: illustrating the publication year for each research application;
Method: illustrating the specific RL algorithmic methodology applied in each work;
Agent: illustrating the agent type of the concerned methodology (single-agent or multi-agent RL approach)
BEMS: illustrating the combination of energy systems within BEMS RES technology integrated in the energy mix (e.g. PV, ESS, TSS, SHW, WT, HVAC, HP, GHP, BIO, etc.)
Residential: defining if the testbed application concerns a residential building control application with an “x”;
Commercial: defining if the testbed application concerns a commercial building control application with an “x”;
Simulation: defining if the testbed application concerns a simulative building control application with an “x”;
Real-life: defining if the testbed application concerns a real-life building control application with an “x”;
Citations: portrays the number of citations—according to Scopus—of each work.

The abbreviation “-” represents the “not identified” elements in the tables and figures.

4.2. Summaries of Applications

A more in-depth analysis of the aforementioned applications is illustrated in Table 6, Table 7, Table 8, Table 9 and Table 10 in the same order:

5. Evaluation

In order to provide a well-structured, comparative overview of existing RL solutions for RES-integrated BEMS, the evaluation centers on seven key attributes: (1) RL Type and Methodology, (2) Model Type, (3) Agent Type, (4) Target Output Type, (5) Baseline Control Type, (6) BEMS Combination Type, and (7) Building Type. Such facets were selected since they capture the most critical dimensions in designing, deploying, and benchmarking RL-based controllers. By dissecting the literature according to these attributes, the interested reader can gain a focused perspective on how RL methods align with different building environments, performance priorities, and design constraints—ultimately guiding more informed decisions about which RL approaches are best suited to specific BEMS needs.

5.1. RL Types and RL Methods

Value-based RL approaches have been the most widely studied in the literature on RES-integrated BEMS from 2015 to 2025 (see Figure 4). Among these, QL stands out as the most frequently adopted method due to its simplicity, flexibility, and suitability for discrete state and action spaces (see Figure 5). QL has proven particularly effective in applications requiring computational efficiency and adaptability to diverse RES components, such as PV systems and geothermal heat pumps (GHP) [84,98,99]. Its straightforward implementation, combined with its ability to optimize energy management within simulation environments, has made it the preferred choice for researchers [85,88,89,91,92,93,94,95].

The dominance of QL in the literature highlights the field’s inclination toward computationally inexpensive methods that may be deployed with minimal tuning. However, QL reliance on discrete action spaces limits the applicability of such a straightforward RL approach considering real-world scenarios where continuous control variables, such as temperature regulation or battery charge levels, play a crucial role. While extensions like function approximation and discretization strategies have been explored, they introduce additional trade-offs in terms of stability and convergence.

DQN, an extension of QL, has gained significant traction in recent years due to advancements in computational capabilities. The integration of deep ANNs as function approximators has enabled DQN to handle high-dimensional state-action spaces, making it more effective in complex energy optimization tasks. Studies [90,96,97,98,100] demonstrate DQN’s ability to outperform traditional QL by improving convergence speed and decision-making accuracy in dynamic energy environments. However, challenges remain in terms of stability, sample efficiency, and generalization across diverse building types; DQN reliance on predefined reward structures poses challenges for real-world applications, contributing to a gradual shift toward alternative RL paradigms. Notably, literature findings indicate that only one value-based application incorporates real-life validation, utilizing the novel Fitted Q-Iteration (FQI) algorithm [86]. Such limited real-world validation of value-based methods underscores the difficulties in designing reward functions that generalize well across varying environmental conditions. The reliance on simulation environments often overlooks crucial aspects, such as sensor noise, hardware latency, and the stochastic behavior of occupants, which can significantly affect policy performance in deployment. This suggests that future research should explore methods to enhance transferability, such as domain adaptation or robust RL techniques.

Policy-based RL methods, such as Proximal Policy Optimization (PPO), exhibit a lower prevalence in RES-integrated BEMS research from 2015 to 2025 (see Figure 5). These approaches demand higher computational resources and require high-quality data for gradient-based optimization, making them less favorable in energy systems research, where modeling inaccuracies and data sparsity are common [103,104]. Their primary advantage lies in their ability to handle continuous action spaces and stochastic policies, making them well suited for complex control optimization. However, the dynamic and uncertain nature of building-integrated RES systems has hindered their widespread adoption, as their high computational cost and lack of robust, scalable frameworks present notable challenges. The lower adoption rate of policy-based methods likely stems from the difficulty in balancing exploration and exploitation in dynamic energy systems. Unlike value-based methods, which rely on well-defined value functions, policy-based approaches appear to be more sensitive to hyperparameter tuning and sample efficiency. Future research efforts should focus on improving sample efficiency and stability through techniques such as trust region optimization and adaptive learning rates, which could make these methods more practical for real-time energy management.

Actor-critic methods, particularly the Twin Delayed Deep Deterministic Policy Gradient (TD3) [107,112,118], DDPG [105,108,113,116], and Soft Actor-Critic (SAC) [110,111,114,115,117], are gaining increasing attention in the field of building-integrated RES control (see Figure 5). These methods bridge the gap between value-based and policy-based approaches, providing a balance between stability and efficiency in optimizing continuous and high-dimensional action spaces. Their rising adoption is driven by their robustness in handling real-world complexities, including nonlinear system dynamics, multi-agent interactions, and partially observable environments—common challenges in energy systems. TD3 and SAC, in particular, have demonstrated superior stability during training and more effective exploration strategies, enabling them to outperform earlier actor-critic variants in complex scenarios. The growing prominence of actor-critic methods reflects an effort within the research community to address the limitations of purely value-based or policy-based approaches, particularly in real-time applications and multi-objective optimization for RES integration. Inevitably, such increasing adoption suggests a shift toward more data-driven and adaptive control strategies in RES-integrated BEMS. Their ability to incorporate both policy optimization and value estimation offers a crucial advantage in dynamically evolving environments. However, one major challenge remains—their sensitivity to reward shaping and hyperparameter selection. Research in this area should explore automated tuning mechanisms, such as meta-RL or RL with evolutionary strategies, to enhance adaptability and reduce the need for manual intervention.

Hybrid RL applications represent an emerging and rapidly growing niche in the field of RES-integrated BEMS. Such approaches combine RL algorithms with complementary techniques, such as fuzzy logic control (FLC) [121,125,128], ANN [124,130], model predictive control (MPC) [127], as well as RBC [125,129], to leverage the strengths of multiple methodologies for improved performance. The most commonly observed combinations include RL with FLC, RBC, and ANNs, each offering distinct advantages for specific energy management challenges. More specifically, integrating QL with FLC provides an intuitive approach to handling uncertainties and nonlinearities in RES systems, making it particularly effective in residential applications where energy demand and supply fluctuations are significant [121]. Similarly, coupling RL and DRL methods with ANNs enables the development of adaptive controllers capable of learning from large, complex datasets while accounting for dynamic system behaviors [124,130]. Additionally, hybrid frameworks such as TD3/MPC offer the benefits of predictive model-based control while utilizing the policy optimization strengths of actor-critic methods [123]. This makes them highly suitable for multi-objective optimization in both residential and commercial settings, further solidifying their potential as an innovative research direction in RL-based energy management.

The rise of hybrid RL approaches signifies a recognition of the limitations inherent in standalone RL models when applied to real-world RES-integrated BEMS. The incorporation of auxiliary control methods mitigates stability and generalization issues, yet it also introduces additional complexity in model integration and interpretability. Future research should explore standardized frameworks for hybrid RL implementation, enabling seamless adaptation across various energy systems without an excessive tuning and computational burden.

5.2. Model Types

Unlike model-free RL, which learns optimal policies through direct interaction and trial-and-error without relying on a predefined system model, model-based RL integrates a predictive model of the environment. This allows the agent to simulate future outcomes, enabling more efficient decision-making and planning. However, despite the growing adoption of RL, only a smaller subset of studies explore model-based RL approaches [86,103,120,122,133,134] due to their higher data efficiency and predictive capabilities (see Figure 6). To this end, the wider implementation of model-based RL for BEMS control is constrained by the complexity of accurately modeling building dynamics and computational overhead.

However, model-based RL may prove to be particularly advantageous in energy systems with predictable physical dynamics, such as HVAC thermal responses or battery charge/discharge cycles. It should be noted though that in real-world BEMS, where occupant behavior and external weather conditions introduce significant uncertainty, even minor modeling errors may lead to suboptimal control strategies. In order to enhance the applications of model-based RL, future work could further explore hybrid frameworks that fuse data-driven adaptation with physics-based models, offering a balance between computational tractability and robustness to real-world variability. Recent studies have begun addressing such challenges by integrating hybrid approaches, combining model-based planning with deep learning techniques to improve scalability and real-world applicability.

5.3. Agent Type

The high prevalence of multi-agent RL in RES-integrated energy systems is evident, with approximately 40% of the evaluated studies employing MARL-based control architectures (see Figure 7). This trend reflects the increasing complexity of modern building energy frameworks, where multiple interconnected energy systems must be managed simultaneously. Similar to other engineering domains that require distributed optimization [135,136], the rise of smart buildings and IoT-driven automation has further accelerated the need for adaptive, self-learning control solutions [137].

As buildings increasingly integrate PV systems, battery storage, HVAC units, and smart loads, optimizing their operation necessitates decentralized or distributed decision-making to account for diverse constraints and dynamic interactions. Each of these subsystems presents distinct operational characteristics, making MARL a compelling solution for energy management. In a MARL framework, decentralized agents—each representing a specific energy component—independently learn and adapt to their local environment while collectively working toward system-wide efficiency objectives, such as cost minimization and energy optimization [97,98,102,116,133]. Moreover, MARL is particularly well suited for handling multi-objective trade-offs, such as balancing thermal comfort with demand-side management or maximizing self-consumption while minimizing grid reliance [101,110,114]. The increasing availability of computational resources and RL frameworks has further facilitated its adoption, solidifying MARL as a fundamental methodology for managing complex, integrated energy systems. Another key advantage of MARL is its robustness in handling uncertainties associated with RES integration. Given the inherent variability of PV and solar thermal output due to weather conditions, MARL enables adaptive control strategies that optimize real-time energy allocation [118,123]. Additionally, it enhances flexibility by dynamically adjusting energy use based on real-time occupancy data, electricity pricing, or predictive demand models [100,117]. These advantages position MARL as a crucial approach for the future of intelligent, decentralized energy management in buildings.

According to the evaluation, value-based approaches, and more specifically, MARL QL, represent the most commonly applied approaches found in [84,85,91,121,130] (Figure 8), along with DQN [90,97,123]. The higher occurrence of value-based methods, particularly QL, in research may be attributed to their straightforward implementation, off-policy nature, and ability to handle discrete action spaces effectively. QL-based MARL frameworks, including DQN, are widely utilized due to their sample efficiency and ease of adaptation to decentralized learning environments. Such methods facilitate independent learning, making them suitable for multi-agent systems where communication constraints and non-stationarity are challenges.

Actor-critic MARL algorithms are also highly prevalent having SAC [114,115,117] and MCTS [122,133,134] as the main representatives. SAC, in particular, is emerging as a dominant choice due to its entropy-regularized objective, which enhances exploration and stability in continuous action spaces. SAC’s ability to balance exploration and exploitation effectively makes it particularly useful in MARL scenarios with complex, high-dimensional control tasks. Less-often applied MARL methodologies include DDPG [116], PPO [131], and A2C [106] likely due to their sensitivity to hyperparameters and exploration inefficiencies in multi-agent environments.

In sum, while multi-agent RL demonstrates clear advantages for complex, decentralized BEMS optimization, it also introduces challenges, such as inter-agent coordination and potential non-stationarity in multi-agent environments. In practice, heterogeneous agents may learn conflicting objectives unless carefully designed reward structures or consensus mechanisms are in place. Future research could focus on hierarchical agent architectures or communication protocols that balance local autonomy with global coordination, ensuring scalable and robust energy management in increasingly interconnected building ecosystems.

5.4. Value and Action Network Types

The fundamental distinction in MARL approaches arises from the structure of value networks (critics) and action networks (policies), where different methodologies have been employed to balance decentralized decision-making and global coordination. In this subsection, we explore some of the prevailing trends in MARL applications for BEMS by examining the structural and training paradigms of value and action networks, as well as the implications for system-wide energy optimization.

Value Networks in MARL Applications: Value networks in MARL serve as the critic, helping agents estimate the expected return for state-action pairs. Two dominant approaches emerge in the existing research: fully decentralized critics, where each agent maintains its own independent Q-network, and centralized critics, which leverage global information to enhance coordination. Studies such as [84,90,91] demonstrate that fully separate critics are advantageous when agents operate independently with minimal reliance on global coordination. Such approaches enable robust, distributed learning but often suffer from non-stationarity, as agents adjust their strategies independently, causing instability in learning dynamics. In contrast, centralized critics, as employed in [85,102,109,116,130], integrate joint state-action evaluations to improve cooperation between agents. The ability to assess system-wide performance leads to more stable and coordinated energy management strategies. However, such practice may introduce scalability challenges, as sharing global state information may be computationally intensive and require significant communication overhead. The trade-off between decentralized and centralized critics is thus driven by the scale of deployment and the necessity for real-time agent collaboration.
Action Networks in MARL Applications: Action networks—or policy networks—determine how agents select actions. The choice between independent and coordinated policy networks is crucial in defining MARL strategies for BEMS. In fully decentralized settings, as seen in [84,90,91,97,115], each agent learns and optimizes its policy independently, often through deep QL or actor-critic models. This method ensures resilience and local autonomy, allowing each agent to optimize its own objectives. However, the lack of explicit cooperation mechanisms may lead to conflicts in decision-making, where independent policies may not align with overall energy efficiency goals. On the other hand, coordinated policy networks found in [85,102,116,131], introduce mechanisms to balance local autonomy with system-wide optimization. Such approaches often adopt a centralized training, decentralized execution (CTDE) paradigm, allowing agents to learn with shared knowledge while maintaining independent execution capabilities. The integration of federated RL [109] or shared reward functions [131] further enhances cooperative learning by aligning incentives among distributed agents without imposing direct communication constraints.

The following Figure 9 portrays the occurrence of the different structure, training and coordination approaches in the multi-agent research works: In Figure 9-Left, the structure and its subtypes occurrence are portrayed (one centralized critic with separate policy networks/fully separate critics and policies for each agent/shared critic and policy across all agents/hybrid types); In Figure 9-Center, the training and its subtypes occurrence are portrayed (centralized training, decentralized execution (CTDE)/fully decentralized training/fully centralized training/hybrid training); In Figure 9-Right, the coordination and its subtypes occurrence are portrayed (implicit coordination/explicit coordination/emergent coordination/predefined roles or hierarchical coordination).

According to the measurements, to create MARL structures, the majority of works either use a centralized critic with separate policies (7 occurrences) or fully independent critics/policies (11 occurrences). Hybrid approaches are rarer, and no study reported here fully shares a single critic and single policy across all agents (see Figure 9-Left). With regard to training, the fully decentralized training and CTDE represent the most common approaches, reflecting a desire to balance local autonomy with some centralized coordination. Mixed/hybrid training is slightly less frequent, while purely centralized training is relatively uncommon. Last but no least, with regard to coordination, explicit coordination and emergent coordination together cover most cases, but hierarchical/predefined-role approaches also often appear. Only a small subset relies purely on implicit coordination (e.g., a shared global reward or shared critic) without any direct information exchange.

5.5. Reward Function Type

In RL, the reward function represents the core mechanism that defines the agent’s learning objective. By providing feedback on actions taken, it guides the agent toward optimal energy management strategies. To this end, a well-designed reward function is crucial for ensuring convergence, stability, and adaptability in RL-based control systems, particularly in building energy management, microgrid optimization, and demand-response applications.

Across the studies reviewed in this paper, a wide variety of reward function structures have been proposed, reflecting the complexity and multi-dimensional nature of integrated energy applications. The reviewed RL applications for RES-integrated BEMS exhibit a variety of reward functions tailored to specific operational objectives. These functions can be categorized broadly into several key architectures. The weighted sum of multiple objectives is the most prevalent, employed by numerous authors, such as [92,94,96,97,98,99,102,105,106,107,108,111,112,113,114,117,118,119,123,125,126,131,133,134]. These studies typically integrate diverse factors, such as energy costs, peak demand, occupant comfort, carbon emissions, renewable energy utilization, and storage management, into a scalar reward through tunable weights reflecting relative priorities. Such practice is highly flexible and allows for balancing various competing objectives; however, it requires careful tuning of weight parameters, which may result in a complicated optimization problem. Another common formulation found in the literature concerns the penalty-based reward, seen in the works of [84,88,93,95,103,106,120]. This type explicitly applies penalties for violating operational constraints, occupant discomfort, or suboptimal battery and appliance operations. Such an approach may provide clear enforcement of constraints, making it useful for safety and reliability, but harsh penalties may lead to unstable learning and overly conservative policies. Only one study employed normalized or relative improvements rewards, emphasizing comparison against baselines or rule-based strategies to quantify improvements explicitly [129]. This category is relatively rare, while it portrays an effective solution for benchmarking against predefined baselines. However, while this method ensures meaningful progress, tracking it may struggle when baselines are not well defined or adaptive. Furthermore, sparse reward functions calculating aggregated daily or episodic results were adopted by [86,90,101,115,122], typically to guide agents toward long-term optimal behaviors by focusing on overall outcomes rather than immediate feedback. This strategy commonly requires advanced exploration strategies, slowing the learning process. A notable sophisticated approach concerned the hierarchical or multi-agent reward structure identified in [102,116,117,121], aiming for coordination among multiple agents or subsystems for holistic optimization. A total of four studies were employed using this methodology, facilitating coordinated decision-making in complex systems. Typically, while such schemes are particularly useful in complicated BEMS, they certainly increase computational—and commonly, communicational—complexity and may require additional mechanisms to prevent conflicting objectives among agents.

Figure 10 illustrates the occurrence of the different types of reward functions for the integrated research applications. By analyzing these reward functions, it is evident that the majority of the integrated studies prefer to employ a weighted sum architecture due to its flexibility in adjusting the relative importance among multiple competing objectives, like cost efficiency, comfort, renewable energy maximization, and system stability. Moreover, occupant comfort consistently emerges as a critical factor, typically managed by explicit penalty terms that strongly discourage comfort deviations, thus reinforcing user satisfaction as a central theme [84,88,93,95,103,106,120]. Also, integration of renewable energy self-consumption and peak load reduction consistently appear, reflecting increasing grid responsiveness and sustainability objectives [98,111,117]. Lastly, there is a notable increase in employing advanced RL architectures, such as multi-agent and hierarchical methods [102,116,117,121], reflecting a trend towards more scalable, distributed, and cooperative control frameworks in future BEMS research.

5.6. Target Output

RL applications for RES primarily focus on optimizing performance, efficiency, and adaptability. A key objective is ensuring occupant comfort satisfaction while balancing energy consumption, as demonstrated in studies that assess comfort satisfaction [91,93,98,99,119,126,132,133]. Additionally, energy savings and cost reduction serve as major evaluation metrics, assessing the ability of RL frameworks to minimize energy usage and operational expenses. As shown in Figure 11, these metrics appear in over 27 studies as primary justifications for RL-based optimization.

Another critical performance metric concerns RES energy capture, particularly PV self-sufficiency and self-consumption, which evaluates how effectively RES-generated energy is utilized on-site, minimizing reliance on external grid power. This metric is observed in nine studies [84,86,92,94,98,100,119,129,131]. Grid reliance represents another essential factor, encompassing load shifting, demand response, peak demand reduction, power balancing, and operational flexibility. Such metrics measure the ability of RL-based energy management strategies to optimize interactions with the electricity grid. Figure 11 indicates that this metric is prevalent in over 19 cases [89,90,94,95,97,100,101,102,109,110,111,113,114,115,117,118,121,123,127,128,129,134]. Environmental footprint represents another significant evaluation criterion, quantifying emissions reductions and overall sustainability improvements. However, relatively few studies explicitly incorporate this factor into their evaluation [116,133] (see Figure 11).

Additional key considerations include faster convergence, training time, computational efficiency, scalability, and data efficiency, which reflect the RL model’s ability to learn optimal strategies quickly and effectively. Computational time is particularly relevant for real-world feasibility, assessing whether an RL-based control approach can operate efficiently in practical deployments. These factors are extensively evaluated in studies focused on model performance (model-related metrics) [89,95,106,111,115]. Other relevant considerations include system resilience, adaptability, and compliance with regulatory frameworks [112,121,122].

Despite the clear emphasis on occupant comfort and energy savings, the practical implementation of RL-based BEMS faces challenges arising from the trade-offs among comfort, cost, and grid reliance. Small improvements in one metric can negatively affect others—for instance, aggressively minimizing costs might compromise thermal comfort or require higher grid dependency. Furthermore, the heterogeneous nature of metrics (e.g., environmental footprint vs. energy capture) often necessitates complex multi-objective reward structures, which can exacerbate difficulties in policy convergence. Future work could investigate adaptive weighting schemes or hierarchical RL approaches that dynamically balance these competing performance indicators in real time.

5.7. Baseline Control

Establishing baseline control approaches is crucial for benchmarking the effectiveness of RL-based energy management strategies. RBC portrays the most commonly used baseline approach for comparison, accounting for more than 50% of all evaluated cases (see Figure 12, Right). Although RBC is widely adopted, its simplistic nature makes it a limited benchmark for advanced RL techniques. Alternative RL approaches have also been used as baselines in nine studies [85,102,105,106,107,109,112,117,126], with DDPG emerging as the most frequently applied RL-based baseline [105,107,112,117,118,126] (see Figure 12, Left).

Optimization-based methods, including Mixed-Integer Linear Programming (MILP) [93,109,128] and MPC [95,119], have also been explored as baselines. Although less prevalent than RBC, these techniques provide structured decision-making models that contrast with RL’s adaptive nature. Evolutionary algorithms, such as Genetic Algorithms (GAs) and Particle Swarm Optimization (PSO), have been employed in specific cases [104,128], while classical control strategies, including Proportional-Integral (PI) controllers, have been evaluated in hybrid or low-complexity implementations [103,127,131] (see Figure 12, Left).

Other alternative baseline approaches have a more limited presence in the literature. For instance, MCTS [134] and peer-to-peer trading models [116] remain relatively underexplored in RL-based energy management. The limited number of studies incorporating these alternative decision-making paradigms suggests a promising direction for future research, particularly in enhancing RL-based optimization frameworks.

While RBC is generally favored for its simplicity and ease of implementation, its limitations reveal only a fraction of RL’s potential benefits. Robust optimization-based methods, such as MPC, may offer more stringent benchmarks against which RL algorithms can be compared towards the demonstration of adaptive decision-making control under dynamic conditions. However, the relatively sparse use of advanced optimization techniques or novel approaches like MCTS indicates a gap in examining how RL truly stacks up against best-in-class baselines. Exploring these underutilized baselines could provide deeper insights into RL’s capacity for handling complex multi-objective and real-time control scenarios.

5.8. RES Type

The dominance of PV integration over other RES types, such as SWH, WT, BIO, GHP, in the energy mix is driven by both technological and economic factors (See Figure 13-Left). PV systems offer a cost-effective, scalable, and easily integrated solution for buildings, whereas other RES technologies often require more extensive infrastructure, higher maintenance, and complex operational strategies. Additionally, advancements in battery storage and smart inverters have further enhanced the viability of PV for self-consumption and grid interaction, reinforcing its widespread adoption. This trend suggests that future RL research in building energy management will likely focus on advanced PV-driven optimization strategies, including real-time demand response, predictive control for ESS, and PV-based multi-agent coordination.

The focus on Hybrid-RES (Multi-RES) configurations (See Figure 13-Right) is closely linked to the adoption of MARL approaches, as managing multiple energy sources requires distributed control strategies capable of optimizing energy flow dynamically [84,114,123,133,134]. As smart buildings and microgrids grow in complexity, future RL research is expected to emphasize hybrid RES frameworks, developing adaptive, scalable, and cooperative RL models that efficiently integrate heterogeneous energy sources, storage solutions, and flexible loads. While PV systems understandably dominate RL-based BEMS research due to their cost-effectiveness and straightforward deployment, this narrow focus risks overlooking the synergies and resilience gains offered by hybrid-RES configurations. As building energy ecosystems become more intricate, future RL solutions must embrace multiple energy sources, exploiting their complementary generation profiles to enhance adaptability, reduce grid reliance, and address the nuanced trade-offs between cost, efficiency, and occupant comfort in real-world environments.

Another key trend emerging from the evaluation of existing studies is the prevalence of single-RES integration in building energy systems. This is primarily due to the lower complexity and reduced implementation costs associated with optimizing a single energy source, particularly PV. However, there is a growing interest in multi-RES (or hybrid-RES) systems, driven by the need for greater energy flexibility, resilience, and efficiency. Combining PV with other RES technologies, such as SWH, WT, BIO, or GHP, enhances self-sufficiency by leveraging complementary generation profiles, mitigating the variability of any single energy source.

5.9. BEMS Type

The analysis of existing studies highlights that PV and ESS are the most frequently combined energy components, emphasizing their role in enhancing self-consumption, grid independence, and energy flexibility [85,87,90,100,122,129,130] (see Figure 14). The integration of HVAC systems [102,105,106,108,109,116], as well as other smart appliances [93,95,121,124], further underlines the importance of demand-side optimization, given that heating and cooling systems contribute significantly to total energy consumption.

Although EVs have been studied less frequently in the literature, they present new opportunities for bidirectional energy flow through vehicle-to-grid (V2G) and vehicle-to-home (V2H) strategies. These strategies enhance grid flexibility but also introduce challenges related to stochastic availability and battery degradation [88,119,128] (see Figure 14). Lastly, the integration of WT and GHP, as observed in studies such as [133,134], suggests a growing interest in hybrid renewable energy systems. These complex multi-energy frameworks require RL-based optimization to effectively manage energy distribution and improve system efficiency within microgrid environments.

While PV and ESS dominate current BEMS research, the interactions among diverse energy assets—HVAC systems, EVs, and additional smart loads—call for more holistic RL-based control strategies that capture their interdependencies. The inclusion of EVs, in particular, necessitates tackling challenges related to battery health and uncertain charging patterns, moving beyond purely static optimization. As building ecosystems diversify, future RL implementations must evolve to coordinate these disparate elements, striking a delicate balance between occupant comfort, operational complexity, and system-wide resilience.

5.10. Building Type

The evaluation of existing studies reveals a clear dominance of residential buildings over commercial buildings in RL applications for RES-integrated BEMS (see Figure 15, Right). Unlike model-free control applications for HVAC systems, which are predominantly implemented in commercial buildings [31], the integration of RES within the energy mix is primarily observed in residential structures.

This trend is influenced by several key factors. Residential energy management typically involves simpler operational dynamics, making RL deployment more feasible. Homeowners and small-scale users increasingly adopt PV systems and battery storage, benefiting from demand-side flexibility and self-consumption optimization—both of which RL can effectively manage. Additionally, the rise of smart home technologies and IoT-based automation is expected to further accelerate RL adoption in residential settings.

In contrast, RL applications in commercial buildings, such as offices [98,109,110,111,112,117,118,123,125,129], university campuses [89,99,103,108], hospitals [116], and restaurants [114,125] (see Figure 15, Left), remain relatively limited.

The higher complexity of energy demands in commercial buildings presents challenges for RL implementation. These environments typically involve HVAC systems, fluctuating occupancy patterns, and strict operational constraints, all of which necessitate more advanced control strategies. Additionally, commercial buildings often employ centralized energy management systems and rely on contract-based energy procurement, reducing the immediate need for RL-driven optimization. However, as RL research advances, applications in commercial settings are expected to expand, particularly in areas such as large-scale multi-zone control, grid-interactive buildings, and hybrid RES integration. In these cases, RL has the potential to optimize load shifting, demand response, and multi-agent coordination, significantly improving energy efficiency and operational resilience.

6. Discussion

The discussion section, divided into Current Trends and Future Directions, provides a structured reflection on key findings and their broader significance. The Current Trends subsection distills recurring themes and predominant approaches in RL applications for RES-integrated BEMS, offering insights into the present research focus. In Future Directions, open research challenges and unexplored areas are identified, equipping readers with strategic considerations for advancing both practical implementations and theoretical developments in the field.

6.1. Trends Identification

✓: Dominance of value-based RL and the rise of actor-critic methods: The evaluation indicates that value-based RL, particularly QL, has been the most applied method due to its simplicity and effectiveness in handling discrete state-action spaces [84,88,89,91,92,93,94,99]. However, as RL applications become more complex, there is a noticeable shift toward actor-critic methods, such as TD3, DDPG, and SAC, which are better suited for high-dimensional, continuous control problems [105,107,113,115,117,118]. These methodologies integrate elements of building physics to enhance learning efficiency and stability while reducing convergence failures [86,103,111]. Despite their higher computational demands, actor-critic frameworks are emerging as a promising alternative for real-time energy optimization.
✓: Hybrid RL for stability and reliability: A growing number of studies integrate RL with complementary control techniques, such as FLC, MPC, or heuristic RBC [104,121,127,128,131,134] (see Figure 4). This hybridization ensures that RL agents operate within safe and reliable constraints, preventing failures in real-world scenarios. Hybrid RL approaches seem particularly effective in energy systems with slow thermal responses, where traditional control handles short-term fluctuations while RL optimizes long-term performance.
✓: Widespread adoption of multi-agent reinforcement learning (MARL) for complex systems: Nearly 40 percent of studies now employ MARL to manage complex energy ecosystems where multiple subsystems require coordinated control [97,98,100,102,116,133] (see Figure 7). This decentralized approach enhances adaptability to fluctuating renewable energy system generation and dynamic energy consumption patterns. However, MARL still faces challenges, such as increased communication overhead and coordination inefficiencies, necessitating advanced mechanisms for agent interaction and stability [110,117,123].
✓: Multi-objective RL for balanced decision-making: RL applications increasingly incorporate multi-objective optimization to balance competing targets, such as energy cost, thermal comfort, carbon emissions, and peak demand reduction [93,101,107,110,117,118] (see Figure 12). Rather than optimizing a single reward function, recent studies introduce weighted trade-offs and explicit constraints to ensure practical feasibility. Some works embed predefined comfort or CO₂ emission thresholds into the learning process, aligning RL policies with regulatory requirements.
✓: Focus on federated, privacy-preserving, and real-time RL: An increasing number of studies explore federated learning in RL-based BEMS, where multiple buildings train local models and share aggregated updates instead of raw data [106,109,116]. This approach enhances data privacy and security while supporting distributed learning. Additionally, there is a shift toward real-time RL applications capable of responding to changing conditions within minutes, moving beyond day-ahead scheduling toward sub-15-minute decision cycles [103,111,131].

6.2. Future Directions

➠: Advancing hybrid RL and algorithmic innovations: Future research should focus on more sophisticated hybrid RL approaches that blend model-based and model-free paradigms to enhance learning efficiency and decision robustness. Current studies highlight the trend of combining RL with predictive models, FLC, and MDP to mitigate convergence issues and accelerate policy adaptation [121,127,131,134]. Future advancements should ensure that RL agents leverage prior knowledge while maintaining adaptability to unforeseen conditions.
➠: Scalable and hierarchical MARL: As BEMS extend beyond individual buildings to microgrids and districts, flat MARL architectures may become inefficient due to increased communication overhead and coordination challenges. Hierarchical RL structures, where high-level policies coordinate multiple lower-level subsystems, can enhance scalability and stability [97,100].
➠: Expanding storage utilization beyond batteries: While battery storage is widely studied, other forms of ESS, such as thermal mass, phase-change materials, and distributed heating/cooling reserves, remain underexplored (see Figure 13, Left). Future research should focus on RL strategies that leverage thermal storage as a controllable resource, such as pre-cooling or intelligent heat storage management [98,133].
➠: Accelerating RL adaptation across buildings and climates: RL models trained for specific buildings often struggle when transferred to different environments due to variations in climate, occupancy, and infrastructure. Future research should develop meta-learning and domain-adaptation techniques to enable RL agents to rapidly adjust to new settings [90,101].
➠: Ensuring safety and compliance with constraints: For RL to be widely adopted in BEMS, safety assurances must be embedded directly into control policies. Techniques such as constrained RL, shielded RL, and Lyapunov-based critics will be essential to ensuring energy systems operate within predefined limits while preventing failures and regulatory non-compliance.
➠: Transition from simulations to real-world deployment: Despite significant progress in RL-driven energy management, most studies remain confined to simulation environments. Notably, only two works among the overall dataset implement real-life experiments considering RL in RES-integrated BEMS. Future research must prioritize real-world implementation with robust fallback mechanisms, ensuring safe operation under unexpected conditions [86].

7. Conclusions

This review involved a large-scale, systematic examination of RL control strategies for building energy management systems that integrate renewable energy system technologies from 2015 to 2025. Unlike previous surveys, this study categorized high-impact, peer-reviewed publications according to their RL method, building typology, and validation methodology, thereby offering an extensive perspective on the current state of research. Emphasis was placed on how decentralized agents coordinate in complex environments, how data-driven exploration can be blended with physically informed techniques, and how emerging hybrid approaches improve real-time adaptability and occupant comfort.

A key innovation of this review lies in its multi-level mapping of algorithmic structures, agent configurations, and operational objectives, highlighting where specific RL methods excel and where they face limitations. By linking these technical details to application outcomes, such as cost reduction, grid flexibility, and reduced carbon footprints, this study illuminates how RL can be practically deployed in increasingly complex building scenarios. Moreover, the compilation of MARL architectures and their coordination mechanisms provides actionable insights for practitioners aiming to design resilient, scalable energy systems, while also addressing occupant comfort and regulatory constraints.

Despite such advancements, a number of refinements could further accelerate RL-based BEMS toward large-scale practical use. Hybrid control schemes, for instance, can merge data-driven learning with model-based or heuristic strategies to shorten convergence times and reduce the likelihood of policy failures. Similarly, the development of enhanced exploration strategies for multi-agent systems would further mitigate non-stationarity and communication overhead, enabling robust coordination under dynamic conditions. Finally, domain adaptation and transfer learning approaches offer the promise of quickly repurposing trained policies across different building types and climates, minimizing the effort of retraining from scratch and fostering broader adoption.

Looking ahead, the next generation of RL-enabled BEMS would benefit from hierarchical MARL to manage control at multiple timescales, privacy-preserving architectures (e.g., federated RL) to facilitate data sharing without compromising user confidentiality, and constrained RL frameworks that embed essential safety and regulatory limits into the learning process. By addressing such research challenges, RL-driven energy management could transition from isolated demonstrations into wide-scale adoption, ultimately driving more sustainable, resilient, and occupant-centric building energy systems.

Funding

This research was partially funded by the SEED4AI project. The project is being implemented within the framework of the National Recovery and Resilience Plan “Greece 2.0”, with funding from the European Union—NextGenerationEU (Implementing Body: Hellenic Foundation for Research and Innovation (HFRI))/ID: 16880. SEED4AI: https://seed4ai.ee.duth.gr/ accessed on 26 March 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

A2C	Advantage Actor-Critic
BEMS	Building Energy Management System
BIO	Biomass Energy Systems
DDPG	Deep Deterministic Policy Gradient
DDQN	Double Deep Q-Network
D3QN	Dueling Double Deep Q-Network
DHW	Domestic Hot Water
DQN	Deep Q-Network
DR	Demand Response
DRL	Deep Reinforcement Learning
ESS	Energy Storage System
EVs	Electric Vehicles
FLC	Fuzzy Logic Control
GA	Genetic Algorithm
GHP	Ground Heat Pump
HEMS	Home Energy Management System
HP	Heat Pump
HVAC	Heating, Ventilation, and Air Conditioning
LS	Lighting System
LSTM	Long Short-Term Memory
MARL	Multi-Agent Reinforcement Learning
MCTS	Monte Carlo Tree Search
MDP	Markov Decision Process
MPC	Model Predictive Control
ANN	Artificial Neural Network
NZEB	Net-Zero Energy Building
PAR	Peak-to-Average Ratio
PV	Photovoltaic
QL	Q-Learning
RBC	Rule-Based Control
RES	Renewable Energy System
RL	Reinforcement Learning
SAC	Soft Actor-Critic
SWH	Solar Water Heating
TD3	Twin Delayed Deep Deterministic Policy Gradient
TSS	Thermal Storage System
V2G	Vehicle-to-Grid
V2H	Vehicle-to-Home
WT	Wind Turbine

References

Economidou, M.; Todeschi, V.; Bertoldi, P.; D’Agostino, D.; Zangheri, P.; Castellazzi, L. Review of 50 years of EU energy efficiency policies for buildings. Energy Build. 2020, 225, 110322. [Google Scholar] [CrossRef]
Pavel, T.; Polina, S.; Liubov, N. The research of the impact of energy efficiency on mitigating greenhouse gas emissions at the national level. Energy Convers. Manag. 2024, 314, 118671. [Google Scholar] [CrossRef]
Ye, J.; Fanyang, Y.; Wang, J.; Meng, S.; Tang, D. A Literature Review of Green Building Policies: Perspectives from Bibliometric Analysis. Buildings 2024, 14, 2607. [Google Scholar] [CrossRef]
Reddy, V.J.; Hariram, N.; Ghazali, M.F.; Kumarasamy, S. Pathway to sustainability: An overview of renewable energy integration in building systems. Sustainability 2024, 16, 638. [Google Scholar] [CrossRef]
Rehmani, M.H.; Reisslein, M.; Rachedi, A.; Erol-Kantarci, M.; Radenkovic, M. Integrating renewable energy resources into the smart grid: Recent developments in information and communication technologies. IEEE Trans. Ind. Inform. 2018, 14, 2814–2825. [Google Scholar] [CrossRef]
Harvey, L.D. Reducing energy use in the buildings sector: Measures, costs, and examples. Energy Effic. 2009, 2, 139–163. [Google Scholar] [CrossRef]
Chel, A.; Kaushik, G. Renewable energy technologies for sustainable development of energy efficient building. Alex. Eng. J. 2018, 57, 655–669. [Google Scholar] [CrossRef]
Farghali, M.; Osman, A.I.; Mohamed, I.M.; Chen, Z.; Chen, L.; Ihara, I.; Yap, P.S.; Rooney, D.W. Strategies to save energy in the context of the energy crisis: A review. Environ. Chem. Lett. 2023, 21, 2003–2039. [Google Scholar] [CrossRef]
Yudelson, J. The Green Building Revolution; Island Press: Washington, DC, USA, 2010. [Google Scholar]
Cao, X.; Dai, X.; Liu, J. Building energy-consumption status worldwide and the state-of-the-art technologies for zero-energy buildings during the past decade. Energy Build. 2016, 128, 198–213. [Google Scholar] [CrossRef]
Gielen, D.; Boshell, F.; Saygin, D.; Bazilian, M.D.; Wagner, N.; Gorini, R. The role of renewable energy in the global energy transformation. Energy Strategy Rev. 2019, 24, 38–50. [Google Scholar] [CrossRef]
Mogoș, R.I.; Petrescu, I.; Chiotan, R.A.; Crețu, R.C.; Troacă, V.A.; Mogoș, P.L. Greenhouse gas emissions and Green Deal in the European Union. Front. Environ. Sci. 2023, 11, 1141473. [Google Scholar]
Famà, R. REPowerEU. In Research Handbook on Post-Pandemic EU Economic Governance and NGEU Law; Edward Elgar Publishing: London, UK, 2024; pp. 128–143. [Google Scholar]
Jäger-Waldau, A.; Bodis, K.; Kougias, I.; Szabo, S. The New European Renewable Energy Directive-Opportunities and Challenges for Photovoltaics. In Proceedings of the 2019 IEEE 46th Photovoltaic Specialists Conference (PVSC), Chicago, IL, USA, 16–21 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 0592–0594. [Google Scholar]
Fetsis, P. The LIFE Programme–Over 20 Years Improving Sustainability in the Built Environment in the EU. Procedia Environ. Sci. 2017, 38, 913–918. [Google Scholar]
Hassan, Q.; Algburi, S.; Sameen, A.Z.; Salman, H.M.; Jaszczur, M. A review of hybrid renewable energy systems: Solar and wind-powered solutions: Challenges, opportunities, and policy implications. Results Eng. 2023, 20, 101621. [Google Scholar] [CrossRef]
Kalogirou, S.A. Building integration of solar renewable energy systems towards zero or nearly zero energy buildings. Int. J. Low-Carbon Technol. 2015, 10, 379–385. [Google Scholar]
Orikpete, O.F.; Ikemba, S.; Ewim, D.R.E. Integration of renewable energy technologies in smart building design for enhanced energy efficiency and self-sufficiency. J. Eng. Exact Sci. 2023, 9, 16423-01e. [Google Scholar] [CrossRef]
Saloux, E.; Teyssedou, A.; Sorin, M. Analysis of photovoltaic (PV) and photovoltaic/thermal (PV/T) systems using the exergy method. Energy Build. 2013, 67, 275–285. [Google Scholar]
Michailidis, P.; Michailidis, I.; Kosmatopoulos, E. Review and Evaluation of Multi-Agent Control Applications for Energy Management in Buildings. Energies 2024, 17, 4835. [Google Scholar] [CrossRef]
Abdulraheem, A.; Lee, S.; Jung, I.Y. Dynamic Personalized Thermal Comfort Model: Integrating Temporal Dynamics and Environmental Variability with Individual Preferences. J. Build. Eng. 2025, 102, 111938. [Google Scholar]
Lu, X.; Fu, Y.; O’Neill, Z. Benchmarking high performance HVAC Rule-Based controls with advanced intelligent Controllers: A case study in a Multi-Zone system in Modelica. Energy Build. 2023, 284, 112854. [Google Scholar]
Drgoňa, J.; Picard, D.; Kvasnica, M.; Helsen, L. Approximate model predictive building control via machine learning. Appl. Energy 2018, 218, 199–216. [Google Scholar] [CrossRef]
Blinn, A.; Kue, U.R.S.; Kennel, F. A Comparison of Model Predictive Control and Heuristics in Building Energy Management. In Proceedings of the 2024 8th International Conference on Smart Grid and Smart Cities (ICSGSC), Shanghai, China, 25–27 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 275–285. [Google Scholar]
Chen, Z.; Xiao, F.; Guo, F.; Yan, J. Interpretable machine learning for building energy management: A state-of-the-art review. Adv. Appl. Energy 2023, 9, 100123. [Google Scholar] [CrossRef]
Ukoba, K.; Olatunji, K.O.; Adeoye, E.; Jen, T.C.; Madyira, D.M. Optimizing renewable energy systems through artificial intelligence: Review and future prospects. Energy Environ. 2024, 35, 3833–3879. [Google Scholar] [CrossRef]
Shobanke, M.; Bhatt, M.; Shittu, E. Advancements and future outlook of Artificial Intelligence in energy and climate change modeling. Adv. Appl. Energy 2025, 17, 100211. [Google Scholar] [CrossRef]
Pergantis, E.N.; Priyadarshan; Al Theeb, N.; Dhillon, P.; Ore, J.P.; Ziviani, D.; Groll, E.A.; Kircher, K.J. Field demonstration of predictive heating control for an all-electric house in a cold climate. Appl. Energy 2024, 360, 122820. [Google Scholar] [CrossRef]
Manic, M.; Wijayasekara, D.; Amarasinghe, K.; Rodriguez-Andina, J.J. Building energy management systems: The age of intelligent and adaptive buildings. IEEE Ind. Electron. Mag. 2016, 10, 25–39. [Google Scholar] [CrossRef]
Zia, M.F.; Elbouchikhi, E.; Benbouzid, M. Microgrids energy management systems: A critical review on methods, solutions, and prospects. Appl. Energy 2018, 222, 1033–1055. [Google Scholar] [CrossRef]
Michailidis, P.; Michailidis, I.; Vamvakas, D.; Kosmatopoulos, E. Model-Free HVAC Control in Buildings: A Review. Energies 2023, 16, 7124. [Google Scholar] [CrossRef]
Mariano-Hernández, D.; Hernández-Callejo, L.; Zorita-Lamadrid, A.; Duque-Pérez, O.; García, F.S. A review of strategies for building energy management system: Model predictive control, demand side management, optimization, and fault detect & diagnosis. J. Build. Eng. 2021, 33, 101692. [Google Scholar]
Pergantis, E.N.; Dhillon, P.; Premer, L.D.R.; Lee, A.H.; Ziviani, D.; Kircher, K.J. Humidity-aware model predictive control for residential air conditioning: A field study. Build. Environ. 2024, 266, 112093. [Google Scholar] [CrossRef]
Pergantis, E.N.; Premer, L.D.R.; Priyadarshan; Lee, A.H.; Dhillon, P.; Groll, E.A.; Ziviani, D.; Kircher, K.J. Latent and Sensible Model Predictive Controller Demonstration in a House During Cooling Operation. ASHRAE Trans. 2024, 130, 177–185. [Google Scholar]
Michailidis, I.T.; Schild, T.; Sangi, R.; Michailidis, P.; Korkas, C.; Fütterer, J.; Müller, D.; Kosmatopoulos, E.B. Energy-efficient HVAC management using cooperative, self-trained, control agents: A real-life German building case study. Appl. Energy 2018, 211, 113–125. [Google Scholar]
Michailidis, P.; Pelitaris, P.; Korkas, C.; Michailidis, I.; Baldi, S.; Kosmatopoulos, E. Enabling optimal energy management with minimal IoT requirements: A legacy A/C case study. Energies 2021, 14, 7910. [Google Scholar] [CrossRef]
Michailidis, I.T.; Sangi, R.; Michailidis, P.; Schild, T.; Fuetterer, J.; Mueller, D.; Kosmatopoulos, E.B. Balancing energy efficiency with indoor comfort using smart control agents: A simulative case study. Energies 2020, 13, 6228. [Google Scholar] [CrossRef]
Michailidis, P.; Michailidis, I.; Gkelios, S.; Kosmatopoulos, E. Artificial Neural Network Applications for Energy Management in Buildings: Current Trends and Future Directions. Energies 2024, 17, 570. [Google Scholar] [CrossRef]
Li, S.E. Reinforcement Learning for Sequential Decision and Optimal Control; Springer: Berlin/Heidelberg, Germany, 2023. [Google Scholar]
Vamvakas, D.; Michailidis, P.; Korkas, C.; Kosmatopoulos, E. Review and evaluation of reinforcement learning frameworks on smart grid applications. Energies 2023, 16, 5326. [Google Scholar] [CrossRef]
Recht, B. A tour of reinforcement learning: The view from continuous control. Annu. Rev. Control Robot. Auton. Syst. 2019, 2, 253–279. [Google Scholar]
Mohammadi, P.; Darshi, R.; Shamaghdari, S.; Siano, P. Comparative Analysis of Control Strategies for Microgrid Energy Management with a Focus on Reinforcement Learning. IEEE Access 2024, 12, 171368–171395. [Google Scholar] [CrossRef]
Ruelens, F.; Claessens, B.J.; Vandael, S.; De Schutter, B.; Babuška, R.; Belmans, R. Residential demand response of thermostatically controlled loads using batch reinforcement learning. IEEE Trans. Smart Grid 2016, 8, 2149–2159. [Google Scholar] [CrossRef]
Kadamala, K.; Chambers, D.; Barrett, E. Enhancing HVAC control systems through transfer learning with deep reinforcement learning agents. Smart Energy 2024, 13, 100131. [Google Scholar] [CrossRef]
Al Sayed, K.; Boodi, A.; Broujeny, R.S.; Beddiar, K. Reinforcement learning for HVAC control in intelligent buildings: A technical and conceptual review. J. Build. Eng. 2024, 95, 110085. [Google Scholar] [CrossRef]
Wang, Z.; Hong, T. Reinforcement learning for building controls: The opportunities and challenges. Appl. Energy 2020, 269, 115036. [Google Scholar] [CrossRef]
Gautam, M. Deep Reinforcement learning for resilient power and energy systems: Progress, prospects, and future avenues. Electricity 2023, 4, 336–380. [Google Scholar] [CrossRef]
Jiang, Z.; Risbeck, M.J.; Ramamurti, V.; Murugesan, S.; Amores, J.; Zhang, C.; Lee, Y.M.; Drees, K.H. Building HVAC control with reinforcement learning for reduction of energy cost and demand charge. Energy Build. 2021, 239, 110833. [Google Scholar] [CrossRef]
Yu, L.; Qin, S.; Zhang, M.; Shen, C.; Jiang, T.; Guan, X. A review of deep reinforcement learning for smart building energy management. IEEE Internet Things J. 2021, 8, 12046–12063. [Google Scholar] [CrossRef]
Liu, S.; Henze, G.P. Experimental analysis of simulated reinforcement learning control for active and passive building thermal storage inventory: Part 2: Results and analysis. Energy Build. 2006, 38, 148–161. [Google Scholar] [CrossRef]
Lazaridis, C.R.; Michailidis, I.; Karatzinis, G.; Michailidis, P.; Kosmatopoulos, E. Evaluating Reinforcement Learning Algorithms in Residential Energy Saving and Comfort Management. Energies 2024, 17, 581. [Google Scholar] [CrossRef]
Singh, Y.; Pal, N. Reinforcement learning with fuzzified reward approach for MPPT control of PV systems. Sustain. Energy Technol. Assess. 2021, 48, 101665. [Google Scholar] [CrossRef]
Wang, Z.; Xue, W.; Li, K.; Tang, Z.; Liu, Y.; Zhang, F.; Cao, S.; Peng, X.; Wu, E.Q.; Zhou, H. Dynamic combustion optimization of a pulverized coal boiler considering the wall temperature constraints: A deep reinforcement learning-based framework. Appl. Therm. Eng. 2025, 259, 124923. [Google Scholar] [CrossRef]
Perera, A.; Kamalaruban, P. Applications of reinforcement learning in energy systems. Renew. Sustain. Energy Rev. 2021, 137, 110618. [Google Scholar]
Gaviria, J.F.; Narváez, G.; Guillen, C.; Giraldo, L.F.; Bressan, M. Machine learning in photovoltaic systems: A review. Renew. Energy 2022, 196, 298–318. [Google Scholar] [CrossRef]
Fu, Q.; Han, Z.; Chen, J.; Lu, Y.; Wu, H.; Wang, Y. Applications of reinforcement learning for building energy efficiency control: A review. J. Build. Eng. 2022, 50, 104165. [Google Scholar]
Rezaie, B.; Esmailzadeh, E.; Dincer, I. Renewable energy options for buildings: Case studies. Energy Build. 2011, 43, 56–65. [Google Scholar]
Lin, Y.; Yang, W.; Hao, X.; Yu, C. Building integrated renewable energy. Energy Explor. Exploit. 2021, 39, 603–607. [Google Scholar]
Hayter, S.J. Integrating Renewable Energy Systems in Buildings (Presentation); Technical report; National Renewable Energy Lab. (NREL): Golden, CO, USA, 2011. [Google Scholar]
Le, T.V.; ChuDuc, H.; Tran, Q.X. Optimized Integration of Renewable Energy in Smart Buildings: A Systematic Review from Scopus Data. In Proceedings of the 2024 9th International Conference on Applying New Technology in Green Buildings (ATiGB), Danang, Vietnam, 30–31 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 397–402. [Google Scholar]
Chen, L.; Hu, Y.; Wang, R.; Li, X.; Chen, Z.; Hua, J.; Osman, A.I.; Farghali, M.; Huang, L.; Li, J.; et al. Green building practices to integrate renewable energy in the construction sector: A review. Environ. Chem. Lett. 2024, 22, 751–784. [Google Scholar]
Hayter, S.J.; Kandt, A. Renewable Energy Applications for Existing Buildings; Technical report; National Renewable Energy Lab. (NREL): Golden, CO, USA, 2011. [Google Scholar]
Vassiliades, C.; Agathokleous, R.; Barone, G.; Forzano, C.; Giuzio, G.; Palombo, A.; Buonomano, A.; Kalogirou, S. Building integration of active solar energy systems: A review of geometrical and architectural characteristics. Renew. Sustain. Energy Rev. 2022, 164, 112482. [Google Scholar]
Bougiatioti, F.; Michael, A. The architectural integration of active solar systems. Building applications in the Eastern Mediterranean region. Renew. Sustain. Energy Rev. 2015, 47, 966–982. [Google Scholar] [CrossRef]
Canale, L.; Di Fazio, A.R.; Russo, M.; Frattolillo, A.; Dell’Isola, M. An overview on functional integration of hybrid renewable energy systems in multi-energy buildings. Energies 2021, 14, 1078. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Ahrarinouri, M.; Rastegar, M.; Seifi, A.R. Multiagent reinforcement learning for energy management in residential buildings. IEEE Trans. Ind. Inform. 2020, 17, 659–666. [Google Scholar]
Hernandez-Leal, P.; Kartal, B.; Taylor, M.E. A survey and critique of multiagent deep reinforcement learning. Auton. Agents Multi-Agent Syst. 2019, 33, 750–797. [Google Scholar]
Busoniu, L.; Babuska, R.; De Schutter, B. A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2008, 38, 156–172. [Google Scholar]
Castellini, J.; Oliehoek, F.A.; Savani, R.; Whiteson, S. The representational capacity of action-value networks for multi-agent reinforcement learning. arXiv 2019, arXiv:1902.07497. [Google Scholar]
Sheng, J.; Wang, X.; Jin, B.; Yan, J.; Li, W.; Chang, T.H.; Wang, J.; Zha, H. Learning structured communication for multi-agent reinforcement learning. Auton. Agents Multi-Agent Syst. 2022, 36, 50. [Google Scholar]
Canese, L.; Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Spanò, S. Multi-agent reinforcement learning: A review of challenges and applications. Appl. Sci. 2021, 11, 4948. [Google Scholar] [CrossRef]
Xie, H.; Song, G.; Shi, Z.; Zhang, J.; Lin, Z.; Yu, Q.; Fu, H.; Song, X.; Zhang, H. Reinforcement learning for vehicle-to-grid: A review. Adv. Appl. Energy 2025, 17, 100214. [Google Scholar]
Mohammadi, P.; Nasiri, A.; Darshi, R.; Shirzad, A.; Abdollahipour, R. Achieving Cost Efficiency in Cloud Data Centers Through Model-Free Q-Learning. In Proceedings of the International Conference on Electrical and Electronics Engineering, Marmaris, Turkey, 22–24 April 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 457–468. [Google Scholar]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar]
Zhang, H.; Seal, S.; Wu, D.; Bouffard, F.; Boulet, B. Building energy management with reinforcement learning and model predictive control: A survey. IEEE Access 2022, 10, 27853–27862. [Google Scholar] [CrossRef]
Wei, T.; Wang, Y.; Zhu, Q. Deep reinforcement learning for building HVAC control. In Proceedings of the 54th Annual Design Automation Conference 2017, Austin, TX, USA, 18–22 June 2017; pp. 1–6. [Google Scholar]
Wang, Y.; He, H.; Tan, X. Truly proximal policy optimization. In Proceedings of the Uncertainty in Artificial Intelligence, PMLR, Virtual, 3–6 August 2020; pp. 113–122. [Google Scholar]
Bolt, P.; Ziebart, V.; Jaeger, C.; Schmid, N.; Stadelmann, T.; Füchslin, R.M. A simulation study on energy optimization in building control with reinforcement learning. In Proceedings of the IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Montréal, QC, Canada, 10–12 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 320–331. [Google Scholar]
Sumiea, E.H.; Abdulkadir, S.J.; Alhussian, H.S.; Al-Selwi, S.M.; Alqushaibi, A.; Ragab, M.G.; Fati, S.M. Deep deterministic policy gradient algorithm: A systematic review. Heliyon 2024, 10, e30697. [Google Scholar]
Ding, F.; Ma, G.; Chen, Z.; Gao, J.; Li, P. Averaged Soft Actor-Critic for Deep Reinforcement Learning. Complexity 2021, 2021, 6658724. [Google Scholar] [CrossRef]
Dong, J.; Wang, H.; Yang, J.; Lu, X.; Gao, L.; Zhou, X. Optimal scheduling framework of electricity-gas-heat integrated energy system based on asynchronous advantage actor-critic algorithm. IEEE Access 2021, 9, 139685–139696. [Google Scholar]
Yang, L.; Nagy, Z.; Goffin, P.; Schlueter, A. Reinforcement learning for optimal control of low exergy buildings. Appl. Energy 2015, 156, 577–586. [Google Scholar] [CrossRef]
Raju, L.; Sankar, S.; Milton, R. Distributed optimization of solar micro-grid using multi agent reinforcement learning. Procedia Comput. Sci. 2015, 46, 231–239. [Google Scholar]
De Somer, O.; Soares, A.; Vanthournout, K.; Spiessens, F.; Kuijpers, T.; Vossen, K. Using reinforcement learning for demand response of domestic hot water buffers: A real-life demonstration. In Proceedings of the 2017 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), Torino, Italy, 26–29 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–7. [Google Scholar]
Ebell, N.; Heinrich, F.; Schlund, J.; Pruckner, M. Reinforcement learning control algorithm for a pv-battery-system providing frequency containment reserve power. In Proceedings of the 2018 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), Aalborg, Denmark, 29–31 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
Remani, T.; Jasmin, E.; Ahamed, T.I. Residential load scheduling with renewable generation in the smart grid: A reinforcement learning approach. IEEE Syst. J. 2018, 13, 3283–3294. [Google Scholar]
Kim, S.; Lim, H. Reinforcement learning based energy management algorithm for smart energy buildings. Energies 2018, 11, 2010. [Google Scholar] [CrossRef]
Prasad, A.; Dusparic, I. Multi-agent deep reinforcement learning for zero energy communities. In Proceedings of the 2019 IEEE PES Innovative Smart Grid Technologies Europe (ISGT-Europe), Bucharest, Romania, 29 September–2 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Xu, X.; Jia, Y.; Xu, Y.; Xu, Z.; Chai, S.; Lai, C.S. A multi-agent reinforcement learning-based data-driven method for home energy management. IEEE Trans. Smart Grid 2020, 11, 3201–3211. [Google Scholar]
Correa-Jullian, C.; Droguett, E.L.; Cardemil, J.M. Operation scheduling in a solar thermal system: A reinforcement learning-based framework. Appl. Energy 2020, 268, 114943. [Google Scholar] [CrossRef]
Chen, S.J.; Chiu, W.Y.; Liu, W.J. User preference-based demand response for smart home energy management using multiobjective reinforcement learning. IEEE Access 2021, 9, 161627–161637. [Google Scholar] [CrossRef]
Lissa, P.; Deane, C.; Schukat, M.; Seri, F.; Keane, M.; Barrett, E. Deep reinforcement learning for home energy management system control. Energy AI 2021, 3, 100043. [Google Scholar] [CrossRef]
Raman, N.S.; Gaikwad, N.; Barooah, P.; Meyn, S.P. Reinforcement learning-based home energy management system for resiliency. In Proceedings of the 2021 American Control Conference (ACC), Online, 25–28 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1358–1364. [Google Scholar]
Heidari, A.; Maréchal, F.; Khovalyg, D. Reinforcement Learning for proactive operation of residential energy systems by learning stochastic occupant behavior and fluctuating solar energy: Balancing comfort, hygiene and energy use. Appl. Energy 2022, 318, 119206. [Google Scholar] [CrossRef]
Lu, J.; Mannion, P.; Mason, K. A multi-objective multi-agent deep reinforcement learning approach to residential appliance scheduling. IET Smart Grid 2022, 5, 260–280. [Google Scholar] [CrossRef]
Shen, R.; Zhong, S.; Wen, X.; An, Q.; Zheng, R.; Li, Y.; Zhao, J. Multi-agent deep reinforcement learning optimization framework for building energy system with renewable energy. Appl. Energy 2022, 312, 118724. [Google Scholar]
Wang, L.; Zhang, G.; Yin, X.; Zhang, H.; Ghalandari, M. Optimal control of renewable energy in buildings using the machine learning method. Sustain. Energy Technol. Assess. 2022, 53, 102534. [Google Scholar] [CrossRef]
Cordeiro-Costas, M.; Villanueva, D.; Eguía-Oller, P.; Granada-Álvarez, E. Intelligent energy storage management trade-off system applied to Deep Learning predictions. J. Energy Storage 2023, 61, 106784. [Google Scholar] [CrossRef]
Mocanu, E.; Mocanu, D.C.; Nguyen, P.H.; Liotta, A.; Webber, M.E.; Gibescu, M.; Slootweg, J.G. On-line building energy optimization using deep reinforcement learning. IEEE Trans. Smart Grid 2018, 10, 3698–3708. [Google Scholar]
Vazquez-Canteli, J.R.; Henze, G.; Nagy, Z. MARLISA: Multi-agent reinforcement learning with iterative sequential action selection for load shaping of grid-interactive connected buildings. In Proceedings of the 7th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, Virtual, 18–20 November 2020; pp. 170–179. [Google Scholar]
Chen, B.; Donti, P.L.; Baker, K.; Kolter, J.Z.; Bergés, M. Enforcing policy feasibility constraints through differentiable projection for energy optimization. In Proceedings of the Twelfth ACM International Conference on Future Energy Systems, Torino, Italy, 28 June–2 July 2021; pp. 199–210. [Google Scholar]
Jung, S.; Jeoung, J.; Kang, H.; Hong, T. Optimal planning of a rooftop PV system using GIS-based reinforcement learning. Appl. Energy 2021, 298, 117239. [Google Scholar]
Yu, L.; Xie, W.; Xie, D.; Zou, Y.; Zhang, D.; Sun, Z.; Zhang, L.; Zhang, Y.; Jiang, T. Deep reinforcement learning for smart home energy management. IEEE Internet Things J. 2019, 7, 2751–2762. [Google Scholar]
Lee, S.; Choi, D.H. Federated reinforcement learning for energy management of multiple smart homes with distributed energy resources. IEEE Trans. Ind. Inform. 2020, 18, 488–497. [Google Scholar]
Ye, Y.; Qiu, D.; Wang, H.; Tang, Y.; Strbac, G. Real-time autonomous residential demand response management based on twin delayed deep deterministic policy gradient learning. Energies 2021, 14, 531. [Google Scholar] [CrossRef]
Touzani, S.; Prakash, A.K.; Wang, Z.; Agarwal, S.; Pritoni, M.; Kiran, M.; Brown, R.; Granderson, J. Controlling distributed energy resources via deep reinforcement learning for load flexibility and energy efficiency. Appl. Energy 2021, 304, 117733. [Google Scholar]
Lee, S.; Xie, L.; Choi, D.H. Privacy-preserving energy management of a shared energy storage system for smart buildings: A federated deep reinforcement learning approach. Sensors 2021, 21, 4898. [Google Scholar] [CrossRef]
Pinto, G.; Piscitelli, M.S.; Vázquez-Canteli, J.R.; Nagy, Z.; Capozzoli, A. Coordinated energy management for a cluster of buildings through deep reinforcement learning. Energy 2021, 229, 120725. [Google Scholar] [CrossRef]
Pinto, G.; Deltetto, D.; Capozzoli, A. Data-driven district energy management with surrogate models and deep reinforcement learning. Appl. Energy 2021, 304, 117642. [Google Scholar] [CrossRef]
Gao, Y.; Matsunami, Y.; Miyata, S.; Akashi, Y. Operational optimization for off-grid renewable building energy system using deep reinforcement learning. Appl. Energy 2022, 325, 119783. [Google Scholar] [CrossRef]
Langer, L.; Volling, T. A reinforcement learning approach to home energy management for modulating heat pumps and photovoltaic systems. Appl. Energy 2022, 327, 120020. [Google Scholar] [CrossRef]
Pinto, G.; Kathirgamanathan, A.; Mangina, E.; Finn, D.P.; Capozzoli, A. Enhancing energy management in grid-interactive buildings: A comparison among cooperative and coordinated architectures. Appl. Energy 2022, 310, 118497. [Google Scholar] [CrossRef]
Nweye, K.; Sankaranarayanan, S.; Nagy, Z. MERLIN: Multi-agent offline and transfer learning for occupant-centric operation of grid-interactive communities. Appl. Energy 2023, 346, 121323. [Google Scholar] [CrossRef]
Qiu, D.; Xue, J.; Zhang, T.; Wang, J.; Sun, M. Federated reinforcement learning for smart building joint peer-to-peer energy and carbon allowance trading. Appl. Energy 2023, 333, 120526. [Google Scholar] [CrossRef]
Xie, J.; Ajagekar, A.; You, F. Multi-agent attention-based deep reinforcement learning for demand response in grid-responsive buildings. Appl. Energy 2023, 342, 121162. [Google Scholar] [CrossRef]
Qiu, Y.; Zhou, S.; Xia, D.; Gu, W.; Sun, K.; Han, G.; Zhang, K.; Lv, H. Local integrated energy system operational optimization considering multi-type uncertainties: A reinforcement learning approach based on improved TD3 algorithm. IET Renew. Power Gener. 2023, 17, 2236–2256. [Google Scholar] [CrossRef]
Deng, X.; Zhang, Y.; Jiang, Y.; Qi, H. A novel operation method for renewable building by combining distributed DC energy system and deep reinforcement learning. Appl. Energy 2024, 353, 122188. [Google Scholar] [CrossRef]
Kazmi, H.; D’Oca, S.; Delmastro, C.; Lodeweyckx, S.; Corgnati, S.P. Generalizable occupant-driven optimization model for domestic hot water production in NZEB. Appl. Energy 2016, 175, 1–15. [Google Scholar] [CrossRef]
Kofinas, P.; Dounis, A.I.; Vouros, G.A. Fuzzy Q-Learning for multi-agent decentralized energy management in microgrids. Appl. Energy 2018, 219, 53–67. [Google Scholar]
Chasparis, G.C.; Pichler, M.; Spreitzhofer, J.; Esterl, T. A cooperative demand-response framework for day-ahead optimization in battery pools. Energy Inform. 2019, 2, 29. [Google Scholar]
Gao, Y.; Matsunami, Y.; Miyata, S.; Akashi, Y. Multi-agent reinforcement learning dealing with hybrid action spaces: A case study for off-grid oriented renewable building energy system. Appl. Energy 2022, 326, 120021. [Google Scholar]
Ashenov, N.; Myrzaliyeva, M.; Mussakhanova, M.; Nunna, H.K. Dynamic cloud and ANN based home energy management system for end-users with smart-plugs and PV generation. In Proceedings of the 2021 IEEE Texas Power and Energy Conference (TPEC), College Station, TX, USA, 2–5 February 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Deltetto, D.; Coraci, D.; Pinto, G.; Piscitelli, M.S.; Capozzoli, A. Exploring the potentialities of deep reinforcement learning for incentive-based demand response in a cluster of small commercial buildings. Energies 2021, 14, 2933. [Google Scholar] [CrossRef]
Huang, C.; Zhang, H.; Wang, L.; Luo, X.; Song, Y. Mixed deep reinforcement learning considering discrete-continuous hybrid action space for smart home energy management. J. Mod. Power Syst. Clean Energy 2022, 10, 743–754. [Google Scholar]
Nicola, M.; Nicola, C.I.; Selișteanu, D. Improvement of the control of a grid connected photovoltaic system based on synergetic and sliding mode controllers using a reinforcement learning deep deterministic policy gradient agent. Energies 2022, 15, 2392. [Google Scholar] [CrossRef]
Almughram, O.; Abdullah ben Slama, S.; Zafar, B.A. A reinforcement learning approach for integrating an intelligent home energy management system with a vehicle-to-home unit. Appl. Sci. 2023, 13, 5539. [Google Scholar] [CrossRef]
Zhou, X.; Du, H.; Sun, Y.; Ren, H.; Cui, P.; Ma, Z. A new framework integrating reinforcement learning, a rule-based expert system, and decision tree analysis to improve building energy flexibility. J. Build. Eng. 2023, 71, 106536. [Google Scholar] [CrossRef]
Binyamin, S.S.; Slama, S.A.B.; Zafar, B. Artificial intelligence-powered energy community management for developing renewable energy systems in smart homes. Energy Strategy Rev. 2024, 51, 101288. [Google Scholar]
Wang, Z.; Xiao, F.; Ran, Y.; Li, Y.; Xu, Y. Scalable energy management approach of residential hybrid energy system using multi-agent deep reinforcement learning. Appl. Energy 2024, 367, 123414. [Google Scholar] [CrossRef]
Anvari-Moghaddam, A.; Rahimi-Kian, A.; Mirian, M.S.; Guerrero, J.M. A multi-agent based energy management solution for integrated buildings and microgrid system. Appl. Energy 2017, 203, 41–56. [Google Scholar] [CrossRef]
Tomin, N.; Shakirov, V.; Kurbatsky, V.; Muzychuk, R.; Popova, E.; Sidorov, D.; Kozlov, A.; Yang, D. A multi-criteria approach to designing and managing a renewable energy community. Renew. Energy 2022, 199, 1153–1175. [Google Scholar] [CrossRef]
Tomin, N.; Shakirov, V.; Kozlov, A.; Sidorov, D.; Kurbatsky, V.; Rehtanz, C.; Lora, E.E. Design and optimal energy management of community microgrids with flexible renewable energy sources. Renew. Energy 2022, 183, 903–921. [Google Scholar] [CrossRef]
Michailidis, I.T.; Manolis, D.; Michailidis, P.; Diakaki, C.; Kosmatopoulos, E.B. Autonomous self-regulating intersections in large-scale urban traffic networks: A Chania City case study. In Proceedings of the 2018 5th international conference on control, decision and information technologies (CoDIT), Thessaloniki, Greece, 10–13 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 853–858. [Google Scholar]
Michailidis, I.T.; Manolis, D.; Michailidis, P.; Diakaki, C.; Kosmatopoulos, E.B. A decentralized optimization approach employing cooperative cycle-regulation in an intersection-centric manner: A complex urban simulative case study. Transp. Res. Interdiscip. Perspect. 2020, 8, 100232. [Google Scholar] [CrossRef]
Michailidis, I.T.; Kapoutsis, A.C.; Korkas, C.D.; Michailidis, P.T.; Alexandridou, K.A.; Ravanis, C.; Kosmatopoulos, E.B. Embedding autonomy in large-scale IoT ecosystems using CAO and L4G-CAO. Discov. Internet Things 2021, 1, 8. [Google Scholar] [CrossRef]

Figure 1. Citations share between the different model-free methodologies for HVAC frameworks: Overall share (%) (Left) and occurrence per year (Right) for 2014–2022 period.

Figure 2. Paper structure.

Figure 3. General concept of RL control in BEMS.

Figure 4. RL types occurrence (Left) and RL types percentage (%) (Right) in RES-integrated BEMS applications (2015–2025).

Figure 5. RL methodologies occurrence (Left) and RL methodologies percentage (%) (Right) in RES-integrated BEMS applications (2015–2025).

Figure 6. Model types occurrence (Left) and model types percentage (%) (Right) in RES-integrated BEMS applications (2015–2025).

Figure 7. Agent-based applications occurrence (Left) and agent-based applications percentage (%) (Right) in RES-integrated BEMS applications (2015–2025).

Figure 8. Multi-agent RL methodologies occurrence in RES-integrated BEMS applications (2015–2025).

Figure 9. MARL structure (Left), training (Center), and coordination (Right) occurrences in RES-integrated BEMS applications (2015–2025).

Figure 10. Reward function type occurrence in RES-integrated BEMS applications (2015–2025).

Figure 11. Building types occurrence (Left) and percentage (%) (Right) in RES-integrated BEMS applications (2015–2025).

Figure 12. Baseline control methodologies occurrence (Left) and percentage (%) (Right) in RES-integrated BEMS applications (2015–2025).

Figure 13. RES types occurrence (Left) and percentage of single-RES and multi-RES applications (%) (Right) in RES-integrated BEMS applications (2015–2025).

Figure 14. BEMS combination types occurrence in RES-integrated BEMS applications (2015–2025).

Figure 15. Building types occurrence (Left) and percentage (%) (Right) in RES-integrated BEMS applications (2015–2025).

Table 1. Summarized table of value-based RL applications for RES-integrated BEMS (2015–2025).

Ref.	Year	Method	Agent	BEMS	Residential	Commercial	Simulation	Real-Life	Citations
[84]	2015	QL	Multi	PV/GHP/HVAC	x		x		150
[85]	2015	QL	Multi	PV/ESS	x		x		37
[86]	2017	FQI	Single	PV/HVAC/DWH	x			x	19
[87]	2018	SARSA	Single	PV/ESS	x		x		16
[88]	2019	QL	Single	PV/ESS/EVs	x		x		84
[89]	2018	QL	Single	PV		x	x		128
[90]	2019	DQN	Multi	PV/ESS			x		39
[91]	2020	QL	Multi	PV/HVAC/LS/EVs	x		x		283
[92]	2020	QL	Single	SWH/HVAC	x		x		43
[93]	2021	QL	Single	PV/ESS/Other	x		x		24
[94]	2021	QL	Single	PV/HVAC/DHW	x		x		132
[95]	2021	QL	Single	PV/ESS/Other	x		x		11
[96]	2022	DQN	Single	PV/SWH/HVAC	x		x		33
[97]	2022	DQN	Multi	PV/Other	x		x		18
[98]	2022	DQN	Single	PV/WT/GHP		x	x		64
[99]	2022	QL	Single	PV/WT/SWH/BIO		x	x		15
[100]	2023	DQN	Single	PV/ESS	x		x		14

Table 2. Summarized table of policy-based RL applications for RES-integrated BEMS (2015–2025).

Ref.	Year	Method	Agent	BEMS	Residential	Commercial	Simulation	Citations
[101]	2019	DPG	Single	PV/EVs/Other	x		x	464
[102]	2020	MARLISA	Multi	PV/ESS/HVAC	x		x	51
[103]	2021	PPO	Single	PV/HVAC		x	x	26
[104]	2021	PPO	Single	PV	x		x	40

Table 3. Summarized table of actor-critic RL applications for RES-integrated BEMS (2015–2025).

Ref.	Year	Method	Agent	BEMS	Residential	Commercial	Simulation	Real-Life	Citations
[105]	2019	DDPG	Single	PV/ESS/HVAC	x		x		304
[106]	2020	A2C	Multi	PV/ESS/HVAC/Other	x		x		112
[107]	2021	TD3	Single	PV/ESS/HVAC/EVs	x		x		39
[108]	2021	DDPG	Single	PV/ESS/HVAC		x		x	56
[109]	2021	-	Multi	PV/ESS/HVAC		x	x		16
[110]	2021	SAC	Single	PV/ESS/TSS/HVAC/DHW		x	x		69
[111]	2021	SAC	Single	PV/HVAC/TSS		x	x		72
[112]	2022	TD3	Single	PV/ESS/BIO		x	x		46
[113]	2022	DDPG	Single	PV/SWH/ESS/HVAC/TSS	x		x		25
[114]	2022	SAC	Multi	PV/SWH/ESS	x	x	x		26
[115]	2023	SAC	Multi	PV/ESS	x		x		21
[116]	2023	DDPG	Multi	PV/ESS/HVAC	x	x	x		46
[117]	2023	SAC	Multi	PV/ESS/DHW/HVAC		x	x	x	35
[118]	2023	TD3	Single	PV/WT/ESS		x	x		10
[119]	2024	SAC	Single	PV/ESS/EVs	x	x	x		12

Table 4. Summarized table of hybrid RL applications for RES-integrated BEMS (2015–2025).

Ref.	Year	Method	Agent	BEMS	Residential	Commercial	Simulation	Real-Life	Citations
[120]	2016	HRL/ACO	Single	PV/DHW/HVAC	x		x	x	69
[121]	2018	QL/FLC	Multi	PV/ESS/Other	x		x		179
[122]	2019	Monte Carlo/ADP	Multi	PV/ESS	x		x		10
[123]	2020	DQN/TD3	Multi	PV/ESS/BIO		x		x	21
[124]	2021	QL/ANN	Single	PV/ESS/Other	x		x		14
[125]	2021	SAC/RBC	Single	PV/TSS		x	x		24
[126]	2022	DQN/DDPG	Single	PV/ESS/HVAC/Other	x		x		45
[127]	2022	TD3/MPC	Single	PV/Other	x		x		11
[128]	2023	QL/FLC	Single	PV/ESS/EVs/Other	x		x		12
[129]	2023	DDPG/RBC	Single	PV/ESS		x	x		26
[130]	2024	QL/ANN	Multi	PV/ESS/EVs	x		x		12
[131]	2024	PPO/IL	Multi	PV/ESS/HVAC	x		x		15

Table 5. Summarized table of other RL applications for RES-integrated BEMS (2015–2025).

Ref.	Year	Method	Agent	BEMS	Residential	Simulation	Citations
[132]	2017	Bayesian RL	Multi	PV/ESS/WT/SWH/HVAC	x	x	37
[133]	2022	MCTS	Multi	PV/ESS/WT/GHP	x	x	23
[134]	2022	MCTS	Multi	PV/ESS/WT/GHP	x	x	116

Table 6. Summaries of Value-based applications.

Author (Year)	Summary
Yang et al. [84]	MARL QL optimized a PV/T-GHP system with floor heating in a Swiss resident. RL learned online to maximize net thermal and electrical output, ensuring GHP compensation and optimal operation. Achieved 100% heat demand coverage (vs. 97% RBC) and 10% higher PV/T energy capture
Raju et al. [85]	Multi-agent QL scheduled PV/ESS operations in a university microgrid, coordinating battery storage for optimal grid interactions. The approach reduced grid reliance by 15–20%, improving energy balancing, and enhancing battery utilization over a 10-year horizon
De Somer et al. [86]	Model-based RL (FQI with ERT regression) optimized DR in DHW buffers of PV-integrated homes. Real-life tests in 6 smart homes showed a 20% increase in PV self-consumption, reducing grid dependence by dynamically scheduling heating cycles.
Ebell et al. [87]	SARSA-ANN-based RL managed PV/ESS providing frequency containment reserve. The agent minimized grid imports while ensuring real-time compliance with demand fluctuations. Simulations showed a 7.8% reduction in grid reliance compared to RBC
Kim et al. [89]	QL single-agent control optimized PV/ESS/V2G scheduling in a university building, minimizing costs through ToU pricing strategies. Reduced daily energy costs by 20–25% and grid reliance by 15–30% while dynamically shifting charging/discharging cycles.
Remani et al. [88]	QL-based RL optimized real-time load scheduling for residential PV-powered homes. Stochastic PV modeling ensured adaptive scheduling, reducing energy costs from 735 units to 16.25 units, enhancing self-sufficiency
Prasad et al. [90]	Multi-agent DQN optimized peer-to-peer energy sharing in a PV/ESS residential community. Agents learn when to store, lend, or borrow power. Simulations showed 40–60 kWh lower grid reliance per household, achieving near-zero energy status
Xu et al. [91]	Multi-agent QL-based HEMS integrates PV, HVAC, LS, EVs, and shiftable loads. Extreme Learning Machine predicted PV output and prices. Cost reductions reached 44.6%, and dynamic scheduling enhanced system flexibility
Correa et al. [92]	Tabular QL optimized SWH scheduling in a university with solar thermal collectors and heat pumps (HP). TRNSYS simulations revealed a 21% efficiency boost in low solar conditions, optimizing demand-side response
Chen et al. [93]	Multi-objective QL optimized DR for PV/ESS smart homes, balancing cost vs. user satisfaction via dual Q-tables. Achieved 8.44% cost savings, with a 1.37% increase in user satisfaction
Lissa et al. [94]	QL-based HEMS with ANN-approximated Q-values optimized a PV/ESS/DHW system. Dynamic setpoint adjustments improved PV self-consumption (+9.5%), energy savings (+16%), and load shifting (+10.2%)
Raman et al. [95]	Zap QL optimized PV/ESS scheduling during grid outages, ensuring resilience. Computation time reduced (0.14 ms RL vs. 62.47 s MPC), maintaining 100% reliability for critical loads in disaster scenarios
Heidari et al. [96]	DDQN optimized a PV/HP/DHW framework, incorporating Legionella control and user comfort constraints. Swiss home trials reported 7–60% energy savings over RBC while maintaining hygiene and adaptability
Lu et al. [97]	MARL DQN optimized residential appliance scheduling with PV. Individual agents learnt cooperative control policies, reducing costs by 30%, peak demand by 18.2%, and improving punctuality by 37.3%.
Shen et al. [98]	D3QN-based MARL optimized PV/WT/GHP systems in office buildings. Prioritized experience replay and feasible action screening reduced discomfort by 84%, minimized unconsumed renewable energy by 43%, and cut energy costs by 8%
Wang et al. [99]	QL-based BEEL method optimizes DHW/PV/BIO interactions in a multi-zone building. Enhanced control over heating, cooling, and storage, reducing heating demand by 26% and cooling energy by 15%
Cordeiro et al. [100]	DQN/SNN control optimized a PV/ESS framework in a university setting. Boosted self-sufficiency (41.39% → 52.92%), reduced grid reliance (−17.85%), and achieved EUR 25,000 annual savings with an 8.56 kg CO₂ cut

Table 7. Summaries of Policy-based applications.

Author (Year)	Summary
Mocanu et al. [101]	Proposed energy management for PV-integrated residents using DPG and DQN. The model was trained on Pecan Street data, optimizing real-time energy consumption and costs, achieving 26.3% peak demand reduction, 27.4% energy cost savings, and improved PV utilization
Vasquez et al. [102]	MARLISA MARL optimized urban energy systems with PV, TSS, HPs, and heaters. MARLISA reduced daily peak demand by 15%, ramping by 35%, and increased the load factor by 10%. Combining MARLISA with RBC accelerated convergence within 2 years
Chen et al. [103]	PPO-based PROF method integrated ANN projection layers to enforce convex constraints in energy optimization. Applied to radiant heating in Carnegie Mellon’s campus, enhancing thermal comfort and reducing energy use by 4%
Jung et al. [104]	GIS-PPO RL model for optimal rooftop PV planning in South Korea. Simulates 10,000 economic scenarios, optimizing panel allocation under future uncertainties. Achieved USD 539,197 profit, outperforming RBC and GA by 4.4%, while reducing global warming potential by 91.8%

Table 8. Summaries of Actor-critic applications.

Author (Year)	Summary
Yu et al. [105]	DDPG optimized HVAC scheduling and ESS management in a resident, integrating PV. The method handled thermal inertia and stochastic DR to minimize energy costs ensuring comfort. Simulations using real-world data, showed 8.10–15.21% less costs compared to RBC
Lee et al. [106]	The MARL A2C optimized PV-ESS and appliance scheduling across multiple residents. Each household trained locally, periodically updating a global model to improve efficiency. The approach reduced electricity costs while maintaining comfort, achieving learning accelerating convergence
Ye et al. [107]	The TD3 DR optimized BEMS integrating PV, ESS, HVAC, and EVs with V2G. The algorithm addressed uncertainties in solar power generation and non-shiftable loads, achieving 12.45% and 5.93% lower energy costs compared to DQN and DPG using real-world datasets
Touzani et al. [108]	Developed a DDPG-based controller for DER management in a commercial building equipped with PV, ESS, and HVAC. The system was trained on synthetic data and deployed in FLEXLAB, demonstrating energy savings through optimized scheduling of RES and load shifting
Lee et al. [109]	Proposed a hierarchical framework with federated learning to optimize shared ESS and HVAC scheduling in smart buildings. The approach preserved privacy by sharing only abstracted model parameters while reducing HVAC consumption by 24–32% and electricity costs by 18.6–20.6%
Pinto et al. [110]	Proposed a centralized SAC for HVAC and TSS management in commercial buildings (offices, retail, and restaurant). The model balanced peak demand reduction, operational cost savings (4% reduction), and load profile smoothing (12% peak decrease, 6% improved load uniformity)
Pinto et al. [111]	Utilized SAC for HP and TSS coordination, integrating LSTM-based indoor temperature predictions. The approach optimized grid interaction and storage utilization, reducing peak demand by 23%, Peak-to-Average Ratio (PAR) by 20%, and having 20% faster training
Gao et al. [112]	Investigated TD3 and DDPG for optimizing off-grid hybrid RES systems in a Japanese office building. The model ensured safe battery operation and grid stability, with TD3 outperforming DDPG by reducing hourly grid power error below 2 kWh and increasing battery safety by 7.72 h
Langer et al. [113]	Examined DDPG for home energy management with PV, modulating air-to-water HP, ESS, and TSS. The model transitioned from MILP to DRL, leveraging domain knowledge for stable learning. Simulations in Germany demonstrated 75% self-sufficiency, 39.6% cost savings, over RBC
Pinto et al. [114]	Developed a MARL-based SAC model for district energy management (three residential buildings and a restaurant). The decentralized approach optimized PV self-consumption and thermal storage, reducing energy costs by 7% and peak demand by 14% compared to RBC
Qiu et al. [116]	Proposed a federated multi-agent DDPG approach for joint peer-to-peer energy and carbon trading in mixed-use buildings (residential, commercial, industrial). The Fed-JPC model reduced total energy and environmental costs by 5.87% and 8.02% while ensuring data privacy
Nweye et al. [115]	Introduced the MERLIN framework utilizing SAC for distributed PV-battery optimization in 17 zero-net-energy homes. Combined real-world smart meter data and simulations, reducing training data needs by 50%, improving ramping by 60%, and lowering peak load by 35%
Xie et al. [117]	Applied a MARL SAC framework with an attention mechanism for DR in grid-interactive buildings. The model controlled PV, ESS, and DHW in residential, office, and academic buildings. Achieved a 6–8% reduction in net demand, USD 92,158 annual electricity savings at Cornell University
Qiu et al. [118]	Proposed a TD3-based RL approach with dynamic noise balancing to optimize PV, wind, ESS, and hydrogen-based energy systems in commercial buildings. The model reduced operating costs by 18.46%, eliminated RES curtailment, and achieved superior accuracy over traditional methods
Deng et al. [119]	Developed the DC-RL model integrating a distributed DC energy system with SAC. Optimized PV, ESS, EVs, and flexible loads for residential and office buildings. Increased PV self-consumption by 38%, satisfaction by 9%, and PV self-sufficiency to 93%, while reducing ESS reliance by 33%

Table 9. Summaries of Hybrid applications.

Author (Year)	Summary
Kazmi et al. [120]	The hybrid RL-combining RL, heuristics and ACO-optimized DHW production in 46 NZEBs in Holland, integrating PV and ASHPs. Simulations showed 27% energy savings, while real-world tests yielded 61 kWh savings over 3.5 months
Kofinas et al. [121]	Proposed a MARL system using fuzzy QL to control a microgrid with PV, diesel generator, fuel cell, ESS, and electrolyzer. Each device RL agent, optimized energy balance independently. System achieved 1.54% uncovered energy, less diesel usage (0.87%), and reduced ESS discharges
Chasparis et al. [122]	Developed a hierarchical ADP-based RL model where an aggregator controlled residential PV/ESS systems for energy bidding. Used Monte Carlo Least Squares ADP for flexibility forecasting. Real-world simulations in Austria (30 homes) improved revenues over 10 test days
Gao et al. [123]	Employed a MARL RL framework using TD3 (continuous actions) and DQN (discrete actions) for off-grid energy optimization in a Japanese office building with PV, BIO, and ESS. Achieved 64.93% improvement in off-grid operation and an 84% reduction in unsafe battery runtime
Ashenov et al. [124]	Integrated ANNs for forecasting and QL for load scheduling in a HEMS with PV, ESS, and smart plugs. Distinguished non-shiftable and shiftable loads, dynamically optimizing scheduling. Achieved a 24% cost reduction and a 15% profit increase in a single-home case study
Deltetto et al. [125]	Combined SAC with RBC for demand response in small commercial buildings (office, retail, restaurant) with PV. SAC-alone reduced costs by 9%, energy use by 7%, and improved peak shaving by 4%, but violated DR constraints. Hybrid SAC-RBC balanced cost and DR compliance.
Huang et al. [126]	Proposed a mixed DRL model integrating DQN and DDPG for discrete-continuous action spaces in HEMS (PV, ESS, HVAC, appliances). A safe-mixed DRL version mitigated comfort violations. Achieved 25.8% cost reduction and significant thermal comfort improvements.
Nicola et al. [127]	Combined Synergetic (SYN) and Sliding Mode Control (SMC) with a TD3 RL agent for MPPT in a 100 kW PV-grid system. The RL agent stabilized DC-link voltage, reducing steady-state error (<0.02%) and overshooting under 30% load variations.
Almughram et al. [128]	Developed RL-HCPV using fuzzy QL and deep learning (ANN, LSTM) for predictive PV generation, EV SOC, and energy trading in smart homes with V2H. Reduced grid reliance by 38% (sunny days) and 24% (cloudy days), cutting electricity bills to USD 3.92/day.
Zhou et al. [129]	Integrated DDPG with a rule-based expert system for energy flexibility in a net-zero energy office building with PV and ESS. Used CART to analyze external influences. Achieved 7% cost reduction, 9.2% PV self-consumption increase, and 10.6% lower grid reliance
Binyamin et al. [130]	Combined multi-agent QL and ANN for smart home P2P energy trading with PV, ESS, and EVs. Optimized load scheduling, storage, and trading under real-time pricing. Achieved 9.3–16.07% household reward improvements and cost reductions under varying solar penetration
Wang et al. [131]	Proposed MAPPO and Imitation Learning for scalable hybrid energy management (PV, ESS, HVAC) in ZEHs. Centralized training, decentralized execution improved PV self-consumption (18.47%) and energy self-sufficiency (46.10%) while ensuring thermal comfort.

Table 10. Summaries of Other RL applications.

Author (Year)	Summary
Anvari et al. [132]	Developed an ontology-driven multi-agent system for energy management in residential microgrids integrating buildings, RES, and controllable loads. The system used Bayesian RL for real-time battery bank optimization, managing PV, WT, DHW, radiant floor systems, and micro-CHP. Achieved 5–13% cost reductions, improved occupant comfort, and enhanced system resilience under dynamic pricing schemes.
Tomin et al. [133]	Proposed a multi-criteria approach for designing and managing renewable energy communities using bi-level programming and RL, namely, the Monte Carlo Tree Search (MCTS) approach. Optimized the operation of PV, WT, biomass gasifiers, and storage in remote Japanese villages. Results showed 75% cost reductions in electricity tariffs, higher RES penetration, and balanced economic-environmental benefits
Tomin et al. [134]	Applied bi-level optimization with RL (MCTS) for designing and managing community microgrids in Siberian settlements. Integrated PV, WT, biomass gasifiers, and storage, optimizing energy market operations. Demonstrated a 20–40% LCOE reduction, improved electricity reliability, and operational flexibility.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Michailidis, P.; Michailidis, I.; Kosmatopoulos, E. Reinforcement Learning for Optimizing Renewable Energy Utilization in Buildings: A Review on Applications and Innovations. Energies 2025, 18, 1724. https://doi.org/10.3390/en18071724

AMA Style

Michailidis P, Michailidis I, Kosmatopoulos E. Reinforcement Learning for Optimizing Renewable Energy Utilization in Buildings: A Review on Applications and Innovations. Energies. 2025; 18(7):1724. https://doi.org/10.3390/en18071724

Chicago/Turabian Style

Michailidis, Panagiotis, Iakovos Michailidis, and Elias Kosmatopoulos. 2025. "Reinforcement Learning for Optimizing Renewable Energy Utilization in Buildings: A Review on Applications and Innovations" Energies 18, no. 7: 1724. https://doi.org/10.3390/en18071724

APA Style

Michailidis, P., Michailidis, I., & Kosmatopoulos, E. (2025). Reinforcement Learning for Optimizing Renewable Energy Utilization in Buildings: A Review on Applications and Innovations. Energies, 18(7), 1724. https://doi.org/10.3390/en18071724

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning for Optimizing Renewable Energy Utilization in Buildings: A Review on Applications and Innovations

Abstract

1. Introduction

1.1. Motivation

1.2. Literature Analysis Approach

1.3. Previous Work

1.4. Contribution and Novelty

1.5. Paper Structure

2. Renewable Energy Systems in the Building Level

2.1. Primary RES Types

2.2. General Concept of Reinforcement Learning Control in BEMS

2.3. General Concept of Reinforcement Learning Control in BEMS

3. Mathematical Framework of Reinforcement Learning

3.1. The General Concept of RL

3.2. Multi-Agent Reinforcement Learning

3.3. Common RL Algorithms for Energy Management

3.3.1. Value-Based Algorithms

3.3.2. Policy-Based Algorithms

3.3.3. Actor-Critic Algorithms

4. Tables and Summaries of RL Applications

4.1. Summarized Tables

4.2. Summaries of Applications

5. Evaluation

5.1. RL Types and RL Methods

5.2. Model Types

5.3. Agent Type

5.4. Value and Action Network Types

5.5. Reward Function Type

5.6. Target Output

5.7. Baseline Control

5.8. RES Type

5.9. BEMS Type

5.10. Building Type

6. Discussion

6.1. Trends Identification

6.2. Future Directions

7. Conclusions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI