1. Introduction
In the USA and Europe, roughly 35% of the energy consumption in 2008 was used in HVAC systems [
1]. To reduce energy consumption and hence the carbon footprint from heat and energy production for HVAC systems, the regulation regarding insulation of buildings has increased [
2]. Another way to reduce energy consumption in buildings is to use more advanced control techniques, reducing energy waste, and increasing comfort. For large buildings, Model Predictive Controllers (MPCs) have shown to be effective [
3,
4], but MPCs require a full thermodynamic model of the building, which is not economically feasible for regular households. Smart controllers based on scheduling according to energy prices are proposed [
5,
6]. These algorithms require less commissioning than an MPC’s but are still comprehensive to commission. To handle the issues with commissioning and still harvest the benefits with advanced control techniques, model-free Reinforcement Learning (RL) is proposed for the control.
This article focuses specifically on control of Underfloor Heating systems (UFH) in domestic houses. Traditionally, hysteresis control with room temperature as input is used for controlling these UFHs. This hysteresis controller fully opens or closes the control valve supplying heat to the floor dependent on the room temperature. The main issue with using hysteresis control in UFH is the slow thermodynamic properties of the system, which can lead to time constants between 10 min to 3 h depending on the floor type and material [
7]. Due to the slow responses in the system, the hysteresis controller is not able to keep the temperature constant because of its inability to predict the energy demand for the rooms. The temperature of the supply water is traditionally controlled by an
ambient temperature compensated controller. This type of controller is, as the hysteresis controller, also affected by the slow response in the convection of heat from the water to the room, but also the delayed response associated with transporting the water in the pipes.
Reinforcement Learning (RL) is a model-free adaptive control method, and as such, it is possible to adapt to the specific dynamic properties of the environment [
8]. This capability makes RL particularly interesting for the control of UFH or Heating, Ventilation, and Air-conditioning (HVAC) systems in general [
9]. In particular, the commissioning phase is automated with RL. Moreover, the user behavior has a large effect on the comfort level of the temperature zones. Here, RL is able to take this behavior into account in the control as well [
10]. In this paper, user behavior is however not included, because the goal is to investigate how the RL algorithms will adapt to the building environments. The user behavior will just complicate this analysis.
The RL algorithm studied in this article is an online learning method. Therefore, the agent/agents of the RL will not perform optimally during training. The training time is highly correlated with the complexity of the state-action space [
8]. This correlation is an issue in the RL control design for UFH, as it makes it difficult to scale the RL algorithm to houses with multiple temperature zones [
11]. To illustrate the scaling problem of having a single agent to control the entire system of a one, two, three, and four-zone UFH system, the calculations of the action-state space have been made for the four cases, see
Table 1.
Table 1 shows that the action-state space grows exponentially with the number of temperature zones in the system, which means that RL control does not scale well in the UFH control task or many other control problems [
8]. To deal with this scaling problem, we propose to incorporate Multi-Agent Reinforcement Learning (MARL). Instead of formulating the problem as a
Markov Decision Process (
MDP) and use Single Agent Reinforcement Learning (SARL), the problem is formulated as a
Markov Game (
MG), and MARL can be used, see
Figure 1 for an illustration of an MDP and an MG.
From
Figure 1, it is seen that the interaction with the environment changes, but not the environment itself. Whereas the environment in a single agent system receives one action, in a multi-agent system it receives an action vector with the same size as the number of agents in the system. The states and rewards received by the agents from the environment are distributed, such that it is possible to only pass relevant state information to a given agent. Formulating an environment as an MG to use MARL has been used in other applications as well. In [
12], a voltage control for a power grid [
12] is designed, and MARL is applied with success for route planning in road network environments in [
13]. MARL is also used for training unmanned fighter aircrafts in air-to-air combat in [
14].
RL for HVAC systems has previously been studied. In [
15], Q-learning is used for a single thermostat. Here 10% energy saving is achieved by scheduling the reference temperature, such that comfort is only considered relevant if there are occupants in the room. RL is used in [
16] for controlling the supply temperature and the pressure in a mixing loop. Though the mixing loop only represents a small part of the entire HVAC system, it shows that RL can outperform state-of-the-art industrial controllers. In [
17], RL is used to control airflow rates for up to five zones, where each zone has an individual actuator. It is found that it is possible to reduce energy cost, but training times increased drastically when going from one to four zones. This result supports the need for an MG formulation, which makes it possible to use MARL in an HVAC system.
To the authors’ knowledge, the results presented in this paper are the first to introduce MARL for the control of UFH systems. Prior papers on MARL in other parts of HVAC systems do exist, though primarily for HVAC systems for commercial buildings. In [
10], an air handling unit controlled by a MARL algorithm formulated as a Markov Game in a cooperative setting is presented. This formulation means that each agent is aware of the state of the other agents, but not the current action taken by other agents. The algorithm is based on an actor-critic model. In that paper, a 5.6% energy saving is achieved. Looking slightly beyond HVAC systems, MARL is used in hot water production for domestic houses in [
18]. Here, MARL is used in a distributed setting where agents are not aware of other agents’ actions.
The paper is organized as follows. First,
Section 2 presents the background and contributions of the work. The general SARL and MARL theory is explained, and which methods inspired the present work, finally elaborating on the contributions to the research field. Then,
Section 3 presents the system design and an evaluation of the designed Dymola multi-physics simulation. This simulation serves as the training and test environment for the designed SARL and MARL algorithms. Next,
Section 4 presents the underlying math and design of the RL systems, along with hyperparameters, input states, and reward functions. Finally,
Section 5 presents the experiments and the analysis of experimental results, and
Section 7 concludes the paper.
2. Background and Contributions
This section introduces the theory behind SARL and MARL, and argues why MARL is relevant in the control of UFH. Furthermore, the contributions of this paper are explained.
2.1. Reinforcement Learning
RL can be divided into three categories: Value-based learning, policy-based learning, and actor-critic-based learning, where actor-critic is a combination of value-based and policy-based learning. In this paper, we focus on value-based learning, specifically Q-learning as a backbone technology, but the benefits of using MARL transfer across all three types [
19]. The central idea of value-based learning is to find the optimal action-value function, which needs to satisfy the Bellman optimality equation (Equation (1)) [
8]. Let
be the optimal value-action function, then
is given as follows:
The bellman equation states that if the future state for all actions is known, then the optimal policy is to choose the next action a’ that results in the highest Q value (). This approach for choosing a is referred to as a greedy policy.
An RL algorithm learns about the environment it interacts with by iteratively updating its estimate of the action-value function, such that the action-value function converges towards for .
Choosing the correct action if the policy is greedy is simple. Updating the Q-function so that converges towards is, however, difficult, at least within a reasonable number of iterations. The update strategy for Q-learning without so-called function approximators is given by Equation (2).
For the update strategy to converge to , it is necessary for the environment to satisfy the conditions of a Markov Decision Process (MDP).
Definition 1. A MDP is defined by a tuple ; S is the finite number of states, A is a finite number of actions, and P is the transition probability for to transition to under a given action a. R is the immediate reward for the expected transition from to . Equation (2) shows that for the Q-function to converge to
, it is necessary to visit all state-action pairs. This is simply not feasible in large systems. Therefore, function approximation is used to approximate the Q function
. By using an Artificial Neural Network (ANN) as a function approximator, it is shown that a Q-function can converge [
20]. Additionally, for function approximation with ANNs, a range of methods have been developed to reduce convergence time or improving convergence in value-based RL, for example double Q-learning [
21], experience replay, prioritized experience replay [
22], and several other methods.
Even though the above-mentioned methods do improve generalization and reduce training, RL, and machine learning, in general, do suffer from what Richard E. Bellman refers to as
“the curse of dimensionality” [
23]. For this reason, MARL is explored for RL control of HVAC systems.
2.2. Multi-Agent Reinforcement Learning
MARL is like SARL, concerned with solving decision-making problems. However, instead of one agent deciding all actions in a system based on one reward, multiple agents decide the actions and receive individual rewards or a joint reward dependent on the type of MARL setting.
This paper focuses on MARL systems formulated as an MG. The formal definition of an MG is as follows:
Definition 2. A Markov Game is defined as a discrete time-stochastic process, a tuple , where N is the number of agents, S is the state space observed by all agents, and is the joint action space of all N agents. is the immediate reward received by agent i for transitioning from to , and P is the transition probability [24]. The definition of an MG can be interpreted as the following: At time T every i’th agent determines an action according to the current state s. The system changes to state with probability P and each agent receives an individual reward based on the new state . When going from SARL to MARL, an entirely new dimension is added to the problem. It is, therefore, necessary to define what type of MG the system is. The type describes how the agents are formulated and affect each other. A MARL problem can be formulated in three ways: (1) A cooperative setting, (2) a competitive setting, and (3) a mixed setting [
25].
Cooperative, Competitive, and Mixed Setting
In general, a MARL algorithm in a cooperative setting is formulated with a common reward function so
and referred to as a Markov team game. A number of different algorithms exist for solving a Markov team game, these include
team-Q [
26] and
Distributed-Q [
27]. Distributed-Q is a MARL algorithm framework, which is proven to work for deterministic environments. It has been shown that all agents will converge to an optimal policy without sharing data between agents. This is very appealing. However, because this article is concerned with a general sum problem, the approach will not work.
A MARL algorithm in a competitive setting is formulated as a zero-sum game where one agent’s win over the other agents. Such a setting can be formulated as
. There exists a number of zero-sum game algorithms, see for example
minimax-Q [
28]. This MARL algorithm setup is however of little interest for this paper, as rewards for a UFH system cannot be formulated as a zero-sum game.
Finally, in the mixed setting each agent has its own reward function, and therefore its own objective. This mixed setting is also referred to as a general sum game.
Game theory and the Nash equilibrium play an essential role in the analysis of these systems. In contrast to a cooperative setting, where it is assumed that the system’s overall best reward can be found by having all agents maximize their own reward, this is not possible in a general sum setting. A number of different algorithms for general sum games have been designed for static tasks [
29]. However, UHF is a dynamics system meaning that static tasks must be adapted to dynamic systems to be interesting for our work.
In addition, single-agent algorithms are used in a mixed setting [
30,
31]. That is, even though there is no grantee of conversion if applying SARL to a multi-agent system [
24]. Algorithms, which are designed for dynamic tasks, include “
Nash-Q” [
32] and “
PD-WoLF” [
33], and will form the starting point for our work.
The idea of a Q-learning algorithm that finds a Nash equilibrium is compelling. Succeeding papers have however argued that the application of Nash-Q is limited to environments that have a unique Nash equilibrium for each iteration [
25]. More recent work on fully decentralized MARL has been proven to converge under the assumption of using linear function approximators for the value function [
34]. Even though this algorithm is distributed, a joint Q function is incorporated, which makes all agents aware of each other. This is necessary to prove general convergence, but it also increases complexity as the number of agents grows and therefore makes the approach less scalable. In a more recent work, agents are distributed in a similar manner to our work [
35]. The main difference between their architecture and our framework is that our agents only observe parts of the state space. Moreover, different methods are utilized for updating the Q-function.
2.3. Contributions
This paper extends the current state-of-the-art for model-free control of UFH systems, including testing SARL on UFH systems and presenting a novel MARL approach to HVAC systems. The novelty lies in the interaction between agents in the MARL algorithm. In distributed Q and Nash-Q, agents are either not aware or completely aware of each other. In this paper, each agent acts according to a well-defined structure as described in
Section 4. Furthermore, the comparison between the SARL simulation and MARL simulation validates the hypothesis that MARL can reduce training times in HVAC systems. Lastly, we present a novel method to ensure robustness for controlling the supply temperature in HVAC systems.
3. System Design and Evaluation
To test hyperparameters, input states, and algorithms, it is necessary to have a simulation environment. A simulation environment is never a 1:1 representation of the real world, but for a simulation environment to be applicable in this study, it is necessary that the simulation, to a large extent, has the same dynamic behavior. To accomplish this, Dymola has been used as the simulation tool.
3.1. Dymola Multi-Physics Environment
Dymola is a Modelica-based multi-physics simulation software and, as such, is suitable for simulating complex systems and processes. Several libraries have been developed for Dymola. For the simulations in this paper, the standard Modelica library and the Modelica Buildings library are used.
The simulation can be split into two parts: (1) The hydraulic part and (2) the thermodynamic part. The hydraulic part of the simulation can be described by a mixing loop, a pump, one valve per temperature zone, and some length of pipe per temperature zone.
The thermodynamic side of the simulation is constructed using the base element “
ReducedOrder.RC.TwoElements”. This element includes heat transfer from exterior walls, windows, and interior walls to the room. It furthermore includes radiation from the outside temperature and radiation from the sun. This means that wind and rain do not affect the simulation, as they are assessed to be smaller disturbances. These disturbances are not negligible, but for the purpose of this paper, the simulation results will still indicate the saving potential that can be expected in real-life installation. The element is made in accordance with “VDI 6007 Part 1”, which is the European standard for calculating the transient thermal response of rooms and buildings [
36].
The length of pipe used in each zone and parameters for the windows, walls, zone area, and volume are shown in
Table 2.
To simulate how the room receives heat from the floor, a floor element has been constructed. The floor element incorporates the pipe length, pipe diameter, floor thickness, floor area, and construction material. These parameters enable the floor element to simulate how the heat from the water running in the pipes will transfer through the concrete and into the room. The heat in the room is assumed to be uniformly distributed. This means that the temperature at the floor, at walls and at the sealing of the room is the same. Modeling the temperature distribution uniformly is also in accordance with “VDI 6007part1”.
3.2. Evaluation of Simulation
It is not possible to validate the simulation environment with data from a real-world system. We can, however, evaluate step responses to evaluate the dynamic convection of heat from the water in the pipes to the air in the room. Additionally, we can evaluate the amount of power the rooms require and compare it to a real-world house. Lastly, the daily and seasonal power consumption can be evaluated.
To evaluate the simulation environment, a run of the simulation with hysteresis control on the valves and an outdoor compensated supply temperature is executed. Note, that hysteresis control is the control method traditionally used in the UFH system. For the validation, a simulation with a one-temperature zone system is used. However, a similar simulation has been made for a four-temperature zone system with similar results.
All simulations are made with a hysteresis control with reference point 22 °C and a dead band of ±0.1 °C. The outside compensated supply temperature follows a linear model, see Equation (
3).
Firstly, the room temperature of an entire heat season is plotted in
Figure 2.
From
Figure 2, it can be seen that a heating season is approximately 280 days. The heating season is defined here as the period of the year where the building needs energy to sustain a zone temperature of 22 °C. The simulation starts 1 March, and the period from 1 June to 1 September has been removed from the weather file as no heat is needed in this period. The seasonal effect can be seen in the figure, where occasional overshoots happen in the period from day 70 to day 140. Hence, in the fall/spring period, where heat is needed during the night and morning but not during the daytime, overshoots happen. The temperature is otherwise oscillating between 21.7 °C and 22.2 °C.
To investigate the response more closely, the room temperature and the associated valve position is plotted over a period of 2 h and 30 min, see
Figure 3.
Figure 3 shows that when a valve is opened, 1700 s (0.02 days) will pass before the temperature gradient in the room becomes positive. Additionally, it can be seen that when the valve is closed, the temperature will continue to rise an additional 0.1 °C and another about 1700 s will pass before the gradient becomes negative. This behavior is due to the slow dynamic properties that are expected of a UFH system, and therefore it also validates that this simulation resembles the typical dynamics of a UFH system.
The price of heating over one heating season is plotted in Figure 5. Before reviewing the plot, it is necessary to explain how this price is calculated. The price will also serve as a benchmark to prove that significant cost savings are possible by utilizing reinforcement learning in UFH. To that end, in this article, it is assumed that the heat supply is an air-to-water heat pump.
The price of heating with a heat pump can be simulated by knowing the cost of electricity, the dynamics of a heat pump, and the power consumption of the system. The cost of electricity is assumed to fluctuate during the day. The average Danish price of electricity during a day can be seen in
Figure 4a [
37]. The dynamics of a heat pump can be described with the Coefficient of Performance (COP), which is a function of the ambient temperature and the supply temperature. This COP can be seen in
Figure 4b [
38]. Additionally, it is necessary to describe the Part Load Factor (PLF), which describes how the efficiency of the heat pump depends on the duty cycle. This PLF is shown in
Figure 4c [
39].
With the Cost of Electricity (CE), the COP and the PLF described and the power consumption of the system (
) available from the simulation, the cost of heating with a heat pump can be simulated with Equation (4):
From
Figure 5, it is evident that the price of heating over one heating season varies. The lowest cost is found in the spring/autumn period and highest during winter. Though there is a yearly trend, it is also evident that the price during a 14-day period can vary 30%, as seen from day 200 to 210.
Finally, the power consumption is reviewed. The temperature zone is 30
and consumes 3561 kWh over a heating season at a reference temperature of 22 °C, meaning an average of 118 kWh per
. An average Danish house uses 115 to 130 kWh per
[
40], which shows that the simulation is within what is considered average in a Danish climate.
The use case is now described, the simulation has been evaluated, and it has been shown that it resembles a traditional Danish house. The RL algorithm can therefore be designed and tested on this simulation.
4. RL Algorithm Design
Figure 6 illustrates the hydraulic system for a four-zone UFH system. The system in the figure consists of a heat pump with a supply temperature and four on/off valves controlling the temperature of each of the four zones. The MARL algorithm is designed as an MG in a general sum setting as explained in
Section 2.2. By reviewing
Section 3, it can be seen that the natural way of dividing the UFH environment into multiple agents is achieved by having one agent control the supply temperature and one agent for each of the temperature zones. Each of the temperature zone agents will, in this setting, control the on/off valves supplying the zones with hot water. This setup means that for a four-temperature zone UFH system, there are five agents, as illustrated in
Figure 6.
By splitting the environment into five agents instead of one, the action space is changed. The result of this change is shown in
Table 3. From the table, it is seen that the actions can be formulated as one action with 240 discrete values or as 5 separate actions, which are either 15 or 2 discrete values. With this formulation of the action space,
. A distributed RL formulation of the problem can be written as in Equation (5).
From Equation (
5), it can be seen that all agents have a Q-function and have full observability of the actions of all other agents. However, if the UFH system is investigated, it can be argued that each valve agent has little to no effect on each other. For this reason, it can be argued that the connections between the valve agents are unnecessary and therefore should be removed. Removing these connections will give the following formulation of the Q-functions, see Equations (6) and (7). Here,
refers to valve 1 to valve
m and
refers to the supply temperature:
Since it is argued that zones have little to no effect on each other, it can also be argued that parts of the state space should only be locally observed. For this reason, the local state space
are defined, and the Q-functions can be rewritten to Equations (8) and (9). Elaboration on which states are relevant for which agents are given in
Section 4.3.
With the Q-functions formulated, the foundation for the MARL algorithm is established. An illustration of the structure of how the agents are interacting with the environment can be seen in
Figure 7.
The following explains (1) which RL methods are used, (2) the pseudo-code for the algorithm, (3) how the reward functions are formulated, (4) how the state-action space is distributed, and (5) which hyper-parameters are used.
4.1. Reinforcement Learning Methods
The backbone RL algorithm used in this paper is the deep double Q-network algorithm with experience replay, N-step learning, and epsilon greedy decay exploration.
The N-step eligibility trace used in the algorithm is also used in [
41,
42]. This approach is chosen due to the slow dynamic of the UFH system, making eligibility traces desirable. Experience replay is also used to enhance data efficiency. The implementation of experience replay is customized to N-step eligibility trace and MARL. This is done by maintaining the experience in mini-batches with the same length as the N-step eligibility trace. Additionally, the experience replay is synchronized between the agents, so the experience for
at time
t has the same timestamp as the experience for
at time
t. The pseudo-code for the MARL algorithm can be seen in Algorithm 1.
Algorithm 1 MARL Deep Q-Learning. |
- 1:
for each iteration k do - 2:
for each environment step do - 3:
Observe state and distribute local states to respective agents. - 4:
Valve agents select action or pick random action with probability epsilon. - 5:
Mixing agent select action or pick random action with probability epsilon. - 6:
Collect action to A execute and observe next state and reward. - 7:
Store () in replay buffers . - 8:
Decay epsilon - 9:
end for - 10:
for each update step do - 11:
Agent n sample experience of size B each with length n-step from . - 12:
Compute Q-values as described in Equations (7) and (9). - 13:
Calculate losses for all agents. - 14:
Calculate gradients with respect to the network weights and perform gradient step. - 15:
Every C environment step, update target networks. - 16:
end for - 17:
end for
|
4.2. Reward Functions
To gain intuition about the reward functions, two base functions are defined—one for the valve system and one for the supply temperature.
4.2.1. Valve Reward
The valve reward is shown in Equation (10). The reward function depends on two sub-functions shown in Equations (11) and (12).
The abbreviations in the equations above are the following: R = Reward SC = Safety controller, = Zone temperature, V = Valve position, and = Hard constraint.
The two sub-functions Equations (11) and (12) are a part of a safety framework for ensuring robust behavior in RL [
43]. In [
43], it is demonstrated that by implementing a safety controller, in this case a tolerance controller, on top of the RL algorithm, robust behavior can be ensured. The safety controller is activated when the RL controller is trying to explore the action-state spaces that are known to be outside safety boundaries. The
variable is the soft constraint that iteratively increases linearly as the agent continuously tries to explore parts of the action-state spaces, which is known to have negative comfort characteristics. The immediate
reward, which is received when the agent is less than −0.4 °C from the reference point, enforces that the comfort is highest when the temperature is in this range. When the reward function is used in the MARL setting, the reward is simply distributed so that each agent receives the reward for the zone it is controlling. When the rewards are used in the SARL setting, all rewards are summed into one reward.
4.2.2. Supply Temperature Reward
The reward function for the supply temperature is shown in Equation (13), and the associated sub-functions are shown in Equations (14) and (15).
The abbreviations in the equations above are the following: R = Reward, SC = Safety controller, = Zone temperature, V = Valve position, = Hard constraint, and P = Price.
From Equation (13), it is seen that the reward is much like the reward from the valve agent, with the difference that the reward requires that the valve is open. When heating with a heat pump, it is optimal to have as much water circulation as possible. Adding that the reward is highest when the valve is open enforces this behavior. The price is a scalar between 0 and 1, and the lower the price of heating, the better.
Like in the valve reward function (Equation (10)), a safety controller is put on top of the RL algorithm. Simulation results show that it has some of the same effects as in the case of the valve agent reward. The safety controller is the
outside compensate supply temperature used in the simulations in
Section 3.2. The safety controller is activated whenever the temperature in a given zone is 1.5 °C lower than the reference temperature, and the associated valve is open.
4.3. Input States
The input states to the RL agent are a mix of actual states of the system and parameters that are functions of the system’s states. The input states are divided into valve input states and supply input states. The input state space is explained from a MARL point of view. From a SARL point of view, the same states are used. These are combined in a single tuple and sent to the single agent.
4.3.1. Valve Input States
Seven states are used in the valve agents. All states are normalized so they assume a value from 0 to 3. These states can be seen in the list below.
Valve agent input:
- -
Room Temp , [°C];
- -
Room Temp , [°C];
- -
Hard constraint Valve ;
- -
Supply temperature, [°C];
- -
Ambient temperature, [°C];
- -
Sun, [w/m];
- -
Time of day, [hour and minutes].
From the list above, seven input states can be seen, the “ Room Temp” is the gradient of the room temperature, and the “Sun” is the strength of the Sun. In the weather file, this strength is measured in [W/m].
4.3.2. Supply Temperature Input States
The input states for the supply temperature can be seen in the list below.
Supply agent input:
- -
Room Temp, [°C];
- -
Room Temp, [°C];
- -
Hard constraint Supply;
- -
Supply Temperature, [°C];
- -
Ambient Temperature, [°C];
- -
Sun, [w/m];
- -
Time of day, [hour and minutes];
- -
Price, [Euro].
From the above, it can be seen that many of the states are the same as in the valve agent, only the price and the hard constraint states are different. An overview of how the number of states increases as the number of temperature zones also increases is given in
Table 4.
4.3.3. Action Space
As explained in
Section 3, the action space for the SARL formulation for a four-zone UFH system is a discrete value from 0 to 239 and the action space for a MARL system is a vector as follows:
, see
Table 3.
Tests in the simulation have shown that when doing simulations with SARL, it is necessary to manipulate the action state space, so that there are 31 actions instead of 240 for a four-zone system. This reduction is done by separating the control of the valves and the control of the supply temperature. That is, the agent can either control the valves or the supply temperature at a given step. This reduction results in 16 actions for the valves, and 15 for the supply temperature, hence the 31 actions all in all. The reduction of the action space is also done in SARL for the simulation of the one-zone system resulting in the 16 actions. That is, 15 mixing actions and one action for closing the valve, hence the action state space for the one-zone SARL and MARL algorithms are similar and therefore, it is expected that the convergence time will be similar. An overview of the action space in the different settings can be seen in
Table 5.
4.3.4. Hyperparameters
The hyperparameters used for training and testing the algorithms are displayed in
Table 6. The values seen below are found from empirical tests of the algorithms.
The following section will present a test plan explaining which experiments are required to prove scalability in MARL and increased performance when compared to a traditional controller. Furthermore, the results of the experiments will be presented and analyzed. All raw data obtained from these simulations can be found in (
https://github.com/ChrBlad/MARL_data, accessed on 30 March 2021).