FS-DDPG: Optimal Control of a Fan Coil Unit System Based on Safe Reinforcement Learning

Li, Chenyang; Fu, Qiming; Chen, Jianping; Lu, You; Wang, Yunzhe; Wu, Hongjie

doi:10.3390/buildings15020226

Open AccessArticle

FS-DDPG: Optimal Control of a Fan Coil Unit System Based on Safe Reinforcement Learning

by

Chenyang Li

¹,

Qiming Fu

^1,2,3,*,

Jianping Chen

^1,2,3,4,*,

You Lu

^1,3

,

Yunzhe Wang

^1,3 and

Hongjie Wu

¹

School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China

²

Jiangsu Province Engineering Research Center of Construction Carbon Neutral Technology, Suzhou University of Science and Technology, Suzhou 215009, China

³

Jiangsu Province Key Laboratory of Intelligent Energy Efficiency, Suzhou University of Science and Technology, Suzhou 215009, China

⁴

School of Architecture and Urban Planning, Suzhou University of Science and Technology, Suzhou 215009, China

^*

Authors to whom correspondence should be addressed.

Buildings 2025, 15(2), 226; https://doi.org/10.3390/buildings15020226

Submission received: 4 November 2024 / Revised: 31 December 2024 / Accepted: 9 January 2025 / Published: 14 January 2025

(This article belongs to the Section Building Energy, Physics, Environment, and Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

To optimize the control of fan coil unit (FCU) systems under model-free conditions, researchers have integrated reinforcement learning (RL) into the control processes of system pumps and fans. However, traditional RL methods can lead to significant fluctuations in the flow of pumps and fans, posing a safety risk. To address this issue, we propose a novel FCU control method, Fluctuation Suppression–Deep Deterministic Policy Gradient (FS-DDPG). The key innovation lies in applying a constrained Markov decision process to model the FCU control problem, where a penalty term for process constraints is incorporated into the reward function, and constraint tightening is introduced to limit the action space. In addition, to validate the performance of the proposed method, we established a variable operating conditions FCU simulation platform based on the parameters of the actual FCU system and ten years of historical weather data. The platform’s correctness and effectiveness were verified from three aspects: heat transfer, the air side and the water side, under different dry and wet operating conditions. The experimental results show that compared with DDPG, FS-DDPG avoids 98.20% of the pump flow and 95.82% of the fan flow fluctuations, ensuring the safety of the equipment. Compared with DDPG and RBC, FS-DDPG achieves 11.9% and 51.76% energy saving rates, respectively, and also shows better performance in terms of operational performance and satisfaction. In the future, we will further improve the scalability and apply the method to more complex FCU systems in variable environments.

Keywords:

safe reinforcement learning; fan coil unit system; model-free control; building energy efficiency

1. Introduction

In the global context, the building sector accounts for over 35% of the world’s energy consumption and nearly 40% of energy-related carbon dioxide emissions [1]. According to projections by the International Energy Agency (IEA) [2], the peak power load in countries with low HVAC penetration is expected to rise by approximately 45% by 2050. This tremendous growth has drawn attention to the management of HVAC systems and energy efficiency. Therefore, it is crucial to improve the efficiency of HVAC systems to reduce energy consumption in the building sector.

Previous studies has mainly focused on air-conditioning heating and cooling sources, with little attention given to terminal air-conditioning equipment, especially fan coil units. The FCU air-conditioning system is the most widely used HVAC subsystem in public facilities [3]. Since FCU has many widely used scenarios, long-term operation and high energy consumption, it is of great significance to study more energy-saving control methods [4]. However, the control parameters of FCU systems involve both linear and nonlinear relationships, and the cooling load constantly fluctuates with the building environment, giving rise to different operating conditions of the FCU. Therefore, optimizing the control of FCU systems remains an urgent problem to be solved.

At present, common control methods for FCUs merely focus on indoor temperature as the control target, including model-based control (MBC) methods [5]; model-free control (MFC) methods [6]; rule-based control (RBC); proportional, integral, derivative (PID) control; and so on. Although naive models are better fit for committed optimization, their inferior precision will impact the control performance. To address these issues, researchers tend to adopt MFC methods [7]. Among MFC methods, RBC methods that rely on expertise are the most commonly used, such as using PID in FCU control to track set points [8]. In addition, because the control parameters are fixed, these methods are not well suited to specific FCU systems [9]. These control methods, due to their ease of deployment, are widely used in practical projects. However, the existing rule-based FCU control methods are relatively stable with few changes, leading to higher energy consumption.

Currently, with the continuous development of artificial intelligence and machine learning (ML), these methods will be further studied and applied in the FCU field [10]. Compared to MBC methods, ML performs better in dynamic environments while learning from real-time data. Compared with MFC methods, ML is capable of accommodating dynamic conditions for capturing complicated system relationships [11]. RL is a method of using ML without relying on physical models. The principle of this method is that the agent will seek the optimal strategy through interaction with the environment. The agent takes action and adjusts its strategy based on rewards from environmental feedback to continuously improve its performance [12]. RL is especially fit for complex environments that need long-range planning. Researchers have already tried to apply RL in the FCU field. RL, with its ability to search for optimal strategies in large-scale state spaces, enables FCU systems to find the most optimal control strategies [13].

However, when traditional RL methods are used to control FCU systems, in some complex and demanding real-world scenarios, collecting data through a trial-and-error method is not only costly but also leads to drastic changes in air and water flow rates within a short period, posing safety risks, as illustrated in Figure 1. The figure depicts part of the fluctuations in water and air flow rates, with the abscissa representing the training episodes and the ordinate showing water and air flow rates. The blue curve represents the changes in water flow, with red points indicating control points where there are abrupt changes in water flow. It is evident that traditional RL control of water pump flow often results in volatile control points (e.g., from 80 to 700 kg/h, and from 700 to 80 kg/h). The orange curve represents the changes in air flow, with red points marking control points where there are abrupt changes in air flow. Similarly, traditional RL control of the air fan flow frequently leads to volatile control points (e.g., from 100 to 1200 m³/h, and from 1200 to 100 m³/h). When the air flow rate of the fan fluctuates sharply, it will lead to unstable air flow and reduce comfort. In addition, it will lead to excessive wear of motors, bearings, etc., and shorten the life of the equipment. When the water flow of the pump fluctuates dramatically, it may cause pressure fluctuations in the pipeline, resulting in pipeline damage. In addition, it will cause overload of the pump body and shorten its service life. Therefore, RL methods need to take safety factors and stability into account and integrate them with traditional control strategies to ensure the long-term and stable operation of FCU systems.

To address the aforementioned challenges, this paper proposes an improved FCU control strategy based on safe reinforcement learning (SRL) and the FS-DDPG control method. The sequential decision problem is modeled as a constrained Markov decision process. This method limits the perceived action space of the agent during the decision-making process by adjusting the constraint tightening in real time, focusing not only on maximizing long-term rewards, but also on how to follow specific security constraints during the decision-making process. This method aims to optimize the regulation of water and air flow within the FCU system, mitigating the impact of action fluctuations on equipment, reducing operational stress, lowering energy consumption and ultimately improving occupant comfort.

The main contributions are as follows:

Based on historical weather data, an office room model was established by using the DeST to calculate the required cooling load for the corresponding period. A heat and mass transfer model for an FCU (incorporating both the theory of thermal and moisture exchange as well as the calculation of parameters within the model) was constructed using Python, which was also validated for its feasibility.
To address the issue of significant system fluctuations caused by the use of RL methods in controlling the FCU process, the FCU control process is modeled as a constrained Markov decision process. A penalty term regarding process constraints is added to the actual reward function, and constraint tightening is introduced to limit the action space perceived by the agent.
We propose the FS-DDPG algorithm to optimize the control strategy of an FCU based on reinforcement learning. Compared with the traditional RL algorithm, FS-DDPG not only seeks the maximum reward but also restrains the violent action fluctuation in the control process and greatly reduces the risk of equipment damage caused by the large-scale regulation of the unit.
The experimental results indicate that this method demonstrates high generalizability and stability, exhibiting significant adaptability in dynamic environments. It not only meets the cooling load requirements within a room while reducing system energy consumption but also ensures the long-term stable operation of the system to the greatest extent. The code and FCU model have been published for future research on the referenced website (available online: https://github.com/leecy123123/fcu_1.git (accessed on 10 January 2025)).

In Section 1, the control method and research status of fan coils are introduced, and then the significance of fan coil control based on SRL is introduced. In Section 2, different control strategies and the application of RL in optimizing system stability and energy efficiency are introduced. In Section 3, the dynamic simulation model of a fan coil system is established based on the actual fan coil system. In Section 4, Markov modeling and control flow based on the FS-DDPG algorithm are introduced. In Section 5, compared with the experimental results of other control methods, the effectiveness of the proposed method in terms of safety and energy saving is analyzed from various angles. In Section 6, the research content is summarized and the future development of the control method is prospected.

2. Related Work

FCU systems have been extensively applied in numerous construction sectors recently. Its performance optimization can be achieved through two control strategies: model-based control and model-free control. Model-based control strategies rely on pre-constructed physical models, whereas model-free control strategy directly manipulates FCU systems through real-time feedback, without depending on pre-built models [14,15,16].

In the early stages, simple rule-based control (RBC) methods were used to manage FCU systems, leading these systems to maintain specific states, which often resulted in indoor temperatures being too high or too low. Through extensive practical application, the PID (proportional, integral, derivative) control algorithm has been gained widespread recognition in the control field for its excellent stability and robustness. As classical control theory has advanced, various complex control strategies based on the traditional PID method have emerged. Li et al. [17] proposed to use a fractional-order PID controller of indoor temperature and a three-position controller of air supply volume to work together to achieve the desired indoor temperature. Li et al. [18] configured this fractional-order PID control system by adjusting the satisfactory values of five parameters of the fractional-order PID controller for indoor temperature. The results show that the related control performance indices of the proposed control system meet the design requirements of comfortable air-conditioning. Despite the widespread application of PID control methods in air-conditioning, since the break frequency of PID was challenging to align with the system model’s break frequency, their effectiveness was limited in complex and nonlinear systems, resulting in a suboptimal control performance.

In recent years, model-based control (MBC) has garnered significant attention in the application of FCU systems. MBC as a form of supervisory control offers good stability and multi-objective rolling optimization. However, the effectiveness of MBC relies heavily on accurate pre-built models and requires precise data on both indoor and outdoor architectural parameter changes [19]. If there is a substantial discrepancy between the mathematical model and the actual FCU system, the control effectiveness of MBC is difficult to guarantee. In model-based control, model predictive control (MPC) is the most commonly used method. Zhao et al. [20] developed a distributed MPC architecture for fan-coil systems, using a hybrid Beetle-antenna search–Particle Swarm Optimization (BAS-PSO) algorithm to optimize the local MPC online. The experimental results show that compared with the traditional PID control, the distributed MPC realizes coordinated multi-zone indoor temperature regulation, and the energy saving rate is 16.23%. Sanama [21] proposed a PID predictive model controller, which can effectively improve indoor air quality and achieve a satisfactory control effect. Martin Chevitch [22] developed an energy model oriented towards control, combining model and identification methods to identify FCU systems, thus achieving predictive building control.

Model-free control (MFC) methods are independent of the FCU model, boasting high adaptability and real-time response capabilities, enabling quick adaptation to environmental changes and uncertainties in the system. However, MFC methods rely heavily on data quality and require higher computational resources. Compared to MBC methods, MFC might fall short in certain performance indicators, such as system energy consumption and operational performance, but it remains within an acceptable range and may be more suitable in certain application scenarios. In the realm of model-free control methods, Guillen et al. [23] built the RELAP5-3D/LSTM model to analyze FCU faults. LSTM is used to predict the FCU inlet temperature according to PMIS. RELAP5-3D simulates the measured FCU inlet temperature and the temperature predicted by LSTM to obtain the FCU outlet temperature. The abnormal operation of an FCU can be detected by comparing the actual value of the FCU outlet temperature with the predictive value. Lin and others [24] developed an optimization control strategy for Heating, Ventilation and Air-Conditioning (HVAC) systems, including temperature control for fan coil units (FCUs). Their proposed strategy includes dynamically setting the FCU temperature according to changes in outdoor temperature; collecting the set FCU temperatures and converting them into the cooling capacity required by the chilled water system; and using genetic algorithms to calculate the minimum energy consumption of the HVAC system. However, machine learning relies heavily on data. If there are insufficient data, poor quality data or overfitting, system performance might be impacted severely [25].

Currently, RL is regarded as a valid method in model-free learning. Its applications in fields such as HVAC have already shown significant success, highlighting its enormous potential in control optimization and energy efficiency. However, the application of RL in the FCU field remains relatively limited. Applying RL algorithms to control FCU systems offers the promise of improving user comfort, optimizing indoor air quality and reducing energy consumption. Chen et al. [26] proposed adopting the DQN algorithm to adjust the air supply volume of an FCU in an office building. The proposed method can improve the satisfaction rate of joint control of indoor temperature and relative humidity. Zhang et al. [13] introduced a DDPG-PK algorithm to regulate the air and water flow of an FCU, aiming to achieve minimal energy consumption while satisfying the cooling load requirements. Unlike MFC methods, this algorithm integrates expert knowledge into the DRL, reducing initial exploration and improving initial performance. However, when controlling FCU systems with traditional RL algorithms, agents may take high-risk exploration actions with significant fluctuations, which are unacceptable for practical FCU systems. Therefore, the concept of SRL is considered in the optimization of FCU system control. Unlike existing studies using Deep Reinforcement Learning (DRL) methods, SRL methods prioritize safety in the learning and decision-making process. This ensures that while pursuing optimal performance, the system places greater emphasis on avoiding dangerous actions and minimizing potential damage to the system. By reducing risks and enhancing system robustness, SRL helps to establish a long-term sustainable FCU system. Hence, researching the use of SRL in FCU control is of significant importance. To address these issues, this paper proposes an FS-DDPG method to regulate the air and water flow rates of the FCU system. This method ensures the system’s stability and safety while meeting the required cooling load with minimal energy consumption. Compared with conventional RL control methods, this method innovatively incorporates policy constraints into the modeling process, thereby constraining actions in FCU systems and ensuring its long-term automated operation.

Additionally, FS-DDPG exhibits strong adaptability and can be applied to various types of HVAC systems and similar control systems. However, its generalizability is influenced by factors such as environmental stability, data quality and system complexity. Therefore, in broader application scenarios, it may be necessary to adjust the assumptions and optimization strategies of the method according to the specific control system.

3. Case Study

This paper tries to optimize an FCU system to satisfy cooling demands while controlling energy consumption and ensuring the safety of the unit’s equipment. The effect of this method is measured with computer simulations. The FCU system operates from Monday to Friday, 7:00 a.m. to 9:00 p.m., depending on the required scenario.

Given the complexity of designing and simulating FCUs, developing a model with high computational accuracy, fast processing speed and ease of data analysis is of significant value and importance for long-term research on FCUs.

3.1. System Operation Overview

The FCU system is a terminal device of air-conditioning systems, consisting of a fan, controller and coils (air heat exchangers). Chilled water provided by refrigeration equipment (such as chillers) enters the terminal coils under the pressure generated by the chilled water pump. To precisely control the water flow rate in each area, shut-off valves are installed at the front end of the fan coils. As the chilled water flows through the terminal fan coil, the temperature of the copper water pipes inside the coil is lowered through forced heat transfer within the pipes. In return, it reduces the temperature of the aluminum fins mounted on the water pipes. When the fan operates, it blows the air between the aluminum fins into the room, enabling the FCU system to satisfy the indoor cooling load. The chilled water then returns to the chiller plants to complete the cycle. The temperature difference of the chilled water before and after the heat exchange process represents the cooling capacity of the FCU system. The overall working schematic is as depicted in Figure 2.

In the control issues of the FCU, there is a correlation between different cooling loads and the control of water and air flow rates in the FCU system. Therefore, in this paper, the cooling load is used as the basis for controlling the FCU system. The water and air flow rates of the FCU system are treated as the control actions of the controller.

3.2. Building Room Model and Simulating Cooling Load

This paper utilizes the DeST software (Version number: 20230713) for simulating the cooling load of rooms. DeST is a software platform designed for the simulation of building environments and HVAC systems, aimed at assisting architects and engineers in evaluating the energy performance of building designs during the early stages of the design process. A room model of an office building was established using DeST. Ten years of weather data were input to calculate the room’s cooling load, which is used for the subsequent optimization control of the FCU system by the RL algorithm. The cooling load here is only used as the state setting in the RL algorithm and is not involved in the training process, which will be detailed in later chapters. The control process is shown in Figure 3.

In this paper, a single-story room model is established in DeST, configured as a floating structure. Eleven rooms and one corridor are set up, with each room equipped with windows. The top-down view of this room is illustrated in Figure 4.

In the room model, 11 rooms and one corridor are established (numbered 1-N-1, 1-N-2, …, 1-N-12). Table 1 shows the usage, area and set room temperature for each room. Notably, rooms 1-N-11 and 1-N-12, being a storeroom and a corridor, respectively, are not equipped with FCUs but are included in the heat exchange calculations with other rooms in the DeST software. The maximum thermal disturbance power of the fixed equipment is set according to the rated power of the fixed equipment of 12 W. The occupancy schedule follows the basic settings for corresponding room usage in DeST, without additional thermal disturbances or ventilation. The equipment schedule aligns with the room’s usage.

After establishing a basic room model in DeST, the weather data from 2012 to 2022 (including 2 m air temperature, surface pressure, surface temperature, dew point temperature, relative humidity, east wind speed, north wind speed, total solar irradiance, net solar irradiance, amount of precipitation, evaporation capacity and ultraviolet intensity) were downloaded from the US National Ocean Service [27], with data from 2012 to 2021 used for the training set and data from 2022 used for testing. The metrics included in this data set are shown in Table 2.

The raw weather data from ClimateData.accdb were pre-processed using Microsoft Access to match the format required in the DeST simulation. The cooling load for the cooling season from 1 May to 30 September was then calculated by DeST based on the weather data. Additionally, ineffective cooling load data were eliminated, focusing only on human activity hours, specifically from 7:00 a.m. to 9:00 p.m., amounting to 14 h per day. This process yielded cooling load values for ten rooms over eleven years, excluding rooms 1-N-11 and 1-N-12. After processing, each room’s cooling load data comprised 23,562 entries with a resolution of 1 h. Figure 5 shows the thermal maps of 2142 cooling load data for room 1-N-6 in 2021.

3.3. Modeling of Fan Coil

To validate the working principle of the control system algorithm, this section establishes an input–output simulation model for the FCU, as illustrated in Figure 6.

The control variables of the system are set as the water flow inside the coil and the air flow discharged by the fan, which are adjusted through the control of the pump and the fan, respectively. This allows for more precise and rapid adjustment of indoor temperature and humidity, while also more accurately matching system demands to optimize energy efficiency. These variables serve as the control points for the subsequent control algorithm analysis. The input variables are the inlet temperature of the chilled water, the dry bulb temperature of the incoming air and the relative humidity of the incoming air. To simplify the model, the outlet temperature of the chilled water is fixed at 7 degrees Celsius, and the dry bulb temperature and relative humidity of the incoming air are set to follow a normal distribution centered around the current outdoor dry bulb and wet bulb temperatures. The model’s output measures the actual cooling capacity of the FCU, along with the corresponding water flow and air flow, and the power consumption of both the pump and the fan.

The physical parameters of the FCU system used in this paper are shown in Appendix A. The model assumptions and detailed calculations of the FCU are shown in Appendix B. Appendix B implements the calculation process of the FCU system, but its validity needs to be verified by experiments. The actual experimental data of a particular model of the FCU system are compared with the calculated results of the model, and the results are shown in Appendix C.

4. Methodology and Control Process

To ensure the safe operation of the FCU system and to prevent significant fluctuations in the actions of the chilled water pump and the fan, a novel energy-saving control method is proposed. This method, named as the FS-DDPG method, specifically addresses the issue of potential severe action fluctuations that may arise in the DDPG control method. By optimizing the actions provided by the actor network in DDPG, FS-DDPG can stabilize the air flow and water flow rates of the fan and pump. Consequently, FS-DDPG not only reduces energy consumption and but also protects the equipment.

4.1. CMDP Modeling

When DRL is used to solve control problems, it is necessary to model the controlled problem as a Markov decision process. However, the uncertainty and uncontrollable factors in the environment are ignored in the traditional MDP, which leads to the instability and performance degradation of the model. To solve these problems, the control process needs to be modeled as a constrained Markov decision process (CMDP). In CMDP, the agent needs to find a strategy that satisfies the constraints so that the agent chooses the optimal action in each state to obtain the maximum return, while satisfying additional constraints. In this section, the control process of the FCU system is modeled as a CMDP, which mainly involves the setting of state, action, cost and reward, as follows:

State

When considering the modeling of Markov decision processes, the properties of the controlled environment are often defined as states in SRL. In this paper, the cooling load considering the current demand is the basis of controlling the water flow and air flow of the FCU system. Hence, the cooling load required at the current moment is taken as the state in SRL, as shown in Equation (1):

\begin{matrix} \begin{matrix} s = \{C L_{t}\} \end{matrix} \end{matrix}

(1)

2.: Action

In the FCU control process, the control node of the system is usually defined as the air flow of the fan and the water flow of the pump. Thus, water flow and air flow are set as actions in SRL, as shown in Equation (2):

\begin{matrix} \begin{matrix} a = \{f l o w_{w a t e r}, f l o w_{a i r}\} \end{matrix} \end{matrix}

(2)

3.: Reward

In the control process, not only does the cooling output of the FCU system need to satisfy the cooling load demand of the room, but it is also necessary to ensure the relative stability of the air flow and water flow. Therefore, when the cooling output generated by the system satisfy the needed cooling load of the room and the flow rate of the fan and the pump changes smoothly, the environment will reward the behavior under the current state. Moreover, the reward is negatively correlated with

Q_{F C U} - C L_{t}

,

d i f f

and

P_{w a t e r} + P_{a i r}

.

\begin{matrix} \begin{matrix} r = - k_{1} (1 - e^{- \frac{{(Q_{F C U} - C L_{t})}^{2}}{2 \times {(0.1 \times C L_{t})}^{2}}}) - (P_{w a t e r} + P_{a i r}) \end{matrix} - λ \cdot d i f f \end{matrix}

(3)

where

Q_{F C U}

is the actual generated cooling capacity,

P_{w a t e r}

is the water flow related power,

P_{a i r}

is the wind flow related power,

k_{1}

are hyperparameters for meeting the current cooling load,

λ

is the penalty term weight and

d i f f

is an indicator of action continuity.

During optimization, there is a trade-off between energy consumption, safety performance and operational performance. For instance, reducing energy consumption might slow down the system response and lead to decreased satisfaction. Therefore, two weight factors,

k_{1}

and

λ

, are used in the reward function to balance energy efficiency, operational performance and safety performance. After ensuring that the values of each part are in the same proportion, the sensitivity analysis experiment is performed to set

k_{1}

to 330 and

λ

to 10. These coefficients ensure that while maintaining satisfaction and safety performance, the system can still minimize energy consumption and avoid performance degradation caused by excessive energy optimization.

As shown in Figure 7, based on the aforementioned mode, the agent firstly obtains the required cooling load, then selects water flow and air flow as actions. After acting on the FCU, the agent obtains a reward and the state moves to the next required cooling load.

4.: Constraint

In the CMDP framework,

C

is considered as a security measure of system actions to balance cumulative rewards with the security of the system. We consider the reward function as the relationship between

C

and

R

, and adjust the objective function of CMDP, as shown in Equation (4):

\begin{matrix} J_{C M D P (π)} = J (π) - λ \cdot E [\sum_{t = 0}^{\infty} γ^{t} C (a_{t}) ∣ s_{0} \sim ρ_{0}, a_{t} \sim π (\cdot ∣ s_{t})] \end{matrix}

(4)

λ

is a hyperparameter used to balance reward and safety constraints. In this problem,

C (a)

is regarded as the relative error of the output action and is defined in Equation (5).

C (a)

consists of

C_{1} (a)

and

C_{2} (a)

, which are defined in Equation (6):

\begin{matrix} C (a) = {w_{1} * C}_{1} (a) + {w_{2} * C}_{2} (a) \end{matrix}

(5)

\begin{matrix} \begin{matrix} C_{1} (a) = \frac{|a_{w a t e r} - a_{w a t e r_p r e}|}{a_{w a t e r}} \\ C_{2} (a) = \frac{|a_{a i r} - a_{a i r_p r e}|}{a_{a i r}} \end{matrix} \end{matrix}

(6)

where

w_{1}

and

w_{2}

are weight parameters,

a_{w a t e r}

is the water flow at time

t

,

a_{w a t e r_p r e}

is the water flow at time

t - 1

,

a_{a i r}

is the air flow at time

t

and

a_{a i r_p r e}

is the air flow at time

t - 1

.

4.2. Action Constraint

In DRL tasks, by adding the constraint on the action to the reward function, the behavior of the agent can be restricted to ensure the security of the FCU system. This section presents how to integrate action constraints into the RL framework to achieve SRL control. This method aims at reducing the uncertainty generated by algorithms in unknown environments, thereby enhancing the stability of the learning process.

4.2.1. Punishment Constraint Method

In continuous-action spaces, constraints can be implemented by introducing a penalty term into the reward function. Assuming the original reward function is defined as

r e w a r d = r (s, a, s^{'})

, where

s

is the current state,

a

is the action selected by the agent and

s^{'}

is the next state. To ensure the safety of actions, we reformulate the reward function as Equation (7):

\begin{matrix} \hat{r} (s, a, s^{'}) = r (s, a, s^{'}) - λ \cdot C (a) \end{matrix}

(7)

where

\hat{r}

is the rewritten reward function,

λ

is the regularization parameter, which regulates the strength of the punishment and

C (a)

is calculated by

C_{1} (a)

and

C_{2} (a)

in Equation (5).

Based on multiple repeated independent experiments and experience, the weight coefficients

w_{1}

and

w_{2}

are set to 0.4 and 0.6, and

λ

is set to 10. In order to control the relative error to be as small as possible and keep the output action of the agent (water flow and air flow of the FCU system) continuous and stable,

λ \cdot C (a)

is expressed as

r_{c}

, which incorporates the relative error into the reward mechanism of RL. Therefore, a piecewise function is designed so that the smaller the relative error, the greater the reward value, as shown in Equation (8):

\begin{matrix} r_{c} = \{\begin{matrix} 1 - C (a), i f C (a) \leq β \\ 0, o t h e r w i s e \end{matrix} \end{matrix}

(8)

where

β

is a pre-set relative error threshold set to 1. When the relative error is less than

β

, the reward function

r_{c}

increases with the decrease in the relative error. Otherwise, the reward is 0. In practical applications, we can choose the appropriate

β

according to the specific task and scenario. Considering the large severe fluctuation in FCU controlled by the traditional RL method, the proposed method can effectively avoid the occurrence of control nodes with large severe fluctuation and greatly improve the safety of the system.

4.2.2. Action Restriction Method

In the practical application of DRL algorithms to the control processes of real-time systems such as FCU, an action constraint function is employed to address the issue of extreme actions during the initial exploration phase in traditional RL methods. This function clips the actions outputted by the actor network to within the permissible range of the FCU system. The mathematical form of this function is shown in Equation (9).

\begin{matrix} c l i p (a, a_{m i n}, a_{m a x}) = \min (\max (a, a_{m i n}), a_{m a x}) \end{matrix}

(9)

where

a

represents the original action,

a_{m i n}

and

a_{m a x}

are the lower and upper limits of the action, respectively. The

C l i p

function takes truncating actions that exceed a specified range, to ensure that they are always kept within acceptable boundaries. In this study, according to the actual FCU system specifications and industry experience, the upper and lower limits of water flow were set to 80 and 700 kg/h, respectively. The upper and lower limits of air flow were set to 100 and 1200 m³/h, respectively. This could prevent the system from being affected by outliers or unstable training. The action restriction method can control the water flow and air flow of the FCU system within a reasonable range to ensure the efficient operation and long-term stability of the system.

4.3. FCU Process Based on SRL FS-DDPG

The control flow of the proposed method is shown below.

1.: At each time step $t$ , the agent needs to receive the state of the current environment (cooling load).
2.: We select $f l o w_{w a t e r}$ and $f l o w_{a i r}$ . We take the currently required cooling load as input to the actor network, while adding a noise signal to obtain $f l o w_{w a t e r}$ and $f l o w_{a i r}$ in the actor network. The method involves using the action pair output by the actor network, incorporating the standard deviation parameter to form a normal distribution, and then randomly generating a new action pair from this distribution to replace the one generated by the original network. Finally, the network action pair is restricted within the safe range by Equation (10).
3.: Model training. After interacting with the environment by the action pair in Step 2, the current reward and the next state can be obtained. The experience data samples $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ generated by the interaction between the actor network and the environment are stored in the experience replay pool. After that, we take a batch of data samples for model training, where the correlation and dependence of samples are removed, which can prompt the convergence of the algorithm.
4.: We end the current learning and go to Step 1. A dual neural network architecture (actor–critic network) is used in DDPG algorithm architecture, and a dual neural network model architecture (target-actor network and target-critic network) is used for both the policy function and the value function. Among them, the state ( $C L_{t}$ ) is the input of DDPG, and the output actions interact with the environment and are stored in the experience replay pool. Figure 8 shows the control method based on FS-DDPG.

4.4. FS-DDPG Algorithm Based on SRL

In DDPG, the evaluation actor network belongs to the policy network, with the current state as input and the action as output; the action needs to have enough exploration; and we need to add noise

N

. The action is shown in Equation (10):

\begin{matrix} a = μ (C L | θ^{μ}) + N \end{matrix}

(10)

where

a

indicates the action taken by the agent, including

f l o w_{w a t e r}

and

f l o w_{a i r}

.

θ^{μ}

is the actor network parameter.

N

is a normally distributed noise;

μ

stands for the evaluation actor network. When an action

a

is selected to act on the environment, the reward

r

and the next state

s^{'}

are subsequently obtained. At this time, experience samples

({s, a, r, s}^{'})

can be received and stored in the experience replay pool. When the samples in the experience replay pool reach the maximum capacity, the agent begins to learn and update the parameters of the network. Each time the network is updated,

N

samples

((s_{i}, a_{i}, r_{i}, s_{i + 1}), i ϵ (1, N))

need to be drawn from the experience pool. The value function

Q

is fitted through the critic network. The target

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} ∣ θ^{μ^{'}}) ∣ θ^{Q^{'}})

of each experience sample needs to be set to update the evaluation critic network. The loss function used to update the evaluation critic network is shown in Equation (11):

\begin{matrix} L = \frac{1}{N} \sum_{i} {(y_{i} - Q (s_{i}, a_{i}| θ^{Q}))}^{2} \end{matrix}

(11)

where

y_{i}

represents the target value of sample

i

.

Q (s_{i}, a_{i}| θ^{Q})

is the output of the evaluation critic network, which receives

s

and

a

as inputs and outputs the corresponding action value function, where

θ^{Q}

represents the parameters of the evaluation critic network. Updates to the evaluation actor network are implemented through a policy gradient, as shown in Equation (12):

\begin{matrix} {{\nabla_{θ^{μ}} \approx \frac{1}{N} \sum_{i} \nabla_{a} Q (s, a ∣ θ^{Q})|}_{s = s_{i}, a = μ (s_{i})} \nabla_{θ^{μ}} μ (s ∣ θ^{μ})|}_{s_{i}} \end{matrix}

(12)

where

\nabla_{a} Q (s, a ∣ θ^{Q})

is the gradient of the action value function

Q

with respect to action

a

. This represents how the

Q

function is adjusted to maximize the expected return in response to the action change.

\nabla_{θ^{μ}} μ (s ∣ θ^{μ})

represents the gradient of the actor function

μ

with respect to its parameter

θ^{μ}

.

Since the network

Q (s, a ∣ θ^{Q})

being updated is also used in calculating the target value, the

Q

update is prone to divergence. As a result, the actor–critic was modified and updated with “soft” targets instead of copying weights directly. We create copies of the evaluation actor network

Q^{'} (s, a ∣ θ^{Q^{'}})

and the evaluation critic network

μ' (s ∣ θ^{μ^{'}})

, respectively, to calculate the target values. These target networks then update their weights by their slow track of the learned network. The target networks are renewed in the form of soft updating paradigm, as shown in Equation (13).

\begin{matrix} \begin{matrix} update : \{\begin{matrix} θ^{Q^{'}} \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}} \\ θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}} \end{matrix} \end{matrix} \end{matrix}

(13)

where

θ^{Q^{'}}

is the parameter of target critic network;

θ^{Q}

is the parameter of evaluation critic network;

θ^{μ^{'}}

is the parameter of target actor network; and

θ^{μ}

is the parameter of evaluation actor network.

τ

is the soft renewal coefficient. This means that the target values are constrained to change slowly, greatly improving the stability of learning.

As shown in the following algorithm, four networks are initialized at the beginning of the algorithm deployment. The initial assignment of network parameters follows a normal distribution to speed up the training process.

The agent accepts the room cooling load data and selects the action of water and air flow through the evaluation actor network. The reward is calculated by Equation (3), and a few moments later, we input the cooling load. We calculate the relative error in action output

∆ a_{t}

and progressively constrain the next action by

∆ a_{t}

in

r

. Based on the above interaction, the sample

(C L_{t}, f l o w_{a i r}, f l o w_{w a t e r}, r, C L_{t + 1})

is collected into the experience replay pool. We draw

N

experience samples from the experience replay pool.

For the network update, the evaluation critic network is updated by minimizing the mean square error between the actual value function and the estimated value function in batch experience. The evaluation actor network specifies the current policy by deterministically mapping states to a specific action. The evaluation actor network is updated through policy gradients.

In order to maintain the stability of the algorithm, the target network is updated by soft updating.

The following Algorithm 1 is an FCU control method based on the FS-DDPG algorithm.

Algorithm 1. The control method of an FCU based on FS-DDPG is presented.

Randomly initialize critic network Q (s, a | θ^{Q})

and actor μ (s | θ^{μ})

with weights θ^{Q}

and θ^{μ}

.

Initialize target network Q^{'}

and μ^{'}

with weights θ^{Q'} \leftarrow θ^{Q}

, θ^{μ^{'}} \leftarrow θ^{μ}

Initialize experience replay pool R

For episode = 1, M do

Initialize a random process

N for action exploration

Receive initial observation state {C L}_{0}

For t = 1, to T do

Select action c l i p {(a}_{t} = μ (s_{t}| θ^{μ}) + N_{t}, a_{m i n}, a_{m a x})

according to the current policy and

exploration noise N_{t}

# The selected action cannot exceed the limit value

Execute action {f l o w}_{w a t e r}

and {f l o w}_{a i r}

and reward r_{t}

and observe new

state {C L}_{t + 1}

Calculate the relative error in action output ∆ a_{t}

and progressively constraint the next

action by ∆ a_{t}

in r

# By calculating the relative error and constraining the
actions

Store sample ({C L}_{t}, ({f l o w}_{w a t e r}, {f l o w}_{a i r}), r_{t}, {C L}_{t + 1})

in R

Sample a random minibatch of N samples

({C L}_{t, i}, ({f l o w}_{w a t e r, i}, {f l o w}_{a i r, i}), r_{t, i}, {C L}_{t + 1, i})

from R

Set y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ' (s_{i + 1} | θ^{'}) | {θ^{Q}}^{'})

Update critic network by minimizing the loss function:

L = \frac{1}{N} \sum_{i} {(y_{i} - Q (s_{i}, a_{i}| θ^{Q}))}^{2}

Update the actor network using the sampled deterministic policy gradient:

\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{i} \nabla_{a} Q (s, a {|θ^{Q})|}_{s = s_{i}, a = μ (s_{i})} \nabla_{θ^{μ}} μ (s {|θ^{μ})|}_{s_{i}}

Soft update the target networks: # Updating the target network by soft updating helps
to stabilize the training process

\begin{array}{l} θ^{Q^{'}} & \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}} \\ θ^{μ^{'}} & \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}} \end{array}

End for
End for

5. Experimental Results and Analysis

5.1. Experimental Parameter Setting

In the FS-DDPG control method based on SRL, there are four networks: the evaluation actor network, the evaluation critic network, the target actor network and the target critic network. The network structure of the evaluation actor is the same as that of the target actor, and the network structure of the evaluation critic is the same as that of the target critic. Table 3 lists the structural parameters of the evaluation actor network and the evaluation critic network.

The exploration degree exponentially decays the standard deviation of a normal distribution.

\begin{matrix} σ_{t} = v a r \times k_{v a r} \end{matrix}

(14)

where

σ_{t}

is the standard deviation in time step

t

,

v a r

is the initial standard deviation, and after a single execution, the standard deviation

σ_{t}

becomes

k_{v a r}

multiplied by the initial value

v a r

, where

k_{v a r}

is less than 1. As time goes by, the standard deviation index decreases, the control method gradually becomes stable. To ensure that the algorithm maintains a certain learning rate in the later stages, the minimum

v a r_{m i n}

is set to 0.15. In CMDP, there is a greater focus on a short period of benefits, so the discount factor

γ

is set to 0.01. For updating the target networks, a soft update method is used, with a soft update coefficient

τ

also set to 0.01. The specific hyperparameters are detailed in Table 4.

5.2. Experimental Comparison Method

The experiment for the FCU system control method based on FS-DDPG primarily involves comparisons with RBC, MBC, DQN and control methods based on DDPG.

DDPG control method: DDPG is a widely used DRL algorithm in continuous action spaces. The advantage of DDPG lies in its ability to automatically adjust control strategies through training, without relying on manually designed rules. However, DDPG tends to exhibit significant action fluctuations in high-dynamic environments, which can compromise equipment safety, reduce energy efficiency and fail to meet practical application requirements. By comparing with DDPG, FS-DDPG can clearly demonstrate significant advantages in reducing flow fluctuations, improving energy efficiency and enhancing control performance. It also proves its effectiveness in addressing the safety issues present in real-world applications. Apart from the strategy constraint optimization, the network parameters in DDPG are the same as those in the FS-DDPG control method.
MBC control method: In practice, it is usually difficult to obtain an accurate FCU model. Moreover, the MBC method here is obtained by exhaustively enumerating all control nodes of FCU. This method obtains the optimal solution from the global perspective, but it consumes a lot of computing resources and lacks dynamic adaptability. Adopting this method, the gap between the FS-DDPG method and the optimal method can be reflected. The objective function for this method is provided in Equation (15):

$\begin{matrix} J (f l o w_{a i r}, f l o w_{w a t e r}) = - k (1 - e^{- \frac{{(Q_{F C U} - C L)}^{2}}{2 \times {(0.1 \times C L)}^{2}}}) - (P_{p u m p} + P_{f a n}) \end{matrix} - k * d i f f$

(15)

where $f l o w_{a i r}$ and $f l o w_{w a t e r}$ are the independent variables. $Q_{F C U}$ , $P_{p u m p}$ and $P_{f a n}$ can be calculated through ${f l o w}_{a i r}$ and ${f l o w}_{w a t e r}$ . $d i f f$ represents the relative error in the actions of the fan and pump before and after. By maximizing the function $J (f l o w_{a i r}, f l o w_{w a t e r})$ through a traversal method, the optimal water flow and air flow for each state are determined.

RBC control method: RBC is a traditional control method that typically adjusts system control parameters based on experience and rules. The RBC method is simple and intuitive, performing well in static or relatively stable environments. However, when facing dynamic changes in the environment, RBC often requires frequent rule adjustments, which is not feasible in complex or uncertain systems. Additionally, RBC has a limited capability in optimizing system energy consumption. In contrast to RBC, FS-DDPG demonstrates its advantages in dynamic environments, particularly in significantly improving system energy efficiency and control accuracy. According to expert experience, the RBC control method is implemented through sequential decision-making, as shown in Figure 9.
The control logic for this sequential decision is as follows: When $C L ϵ (0,2700)$ , set $f l o w_{a i r} = 300 m^{3} / h$ and $f l o w_{w a t e r} = 250 k g / h$ ; when CL $C L ϵ (2700,3500)$ , set $f l o w_{a i r} = 500 m^{3} / h$ and $f l o w_{w a t e r} = 350 k g / h$ ; when $C L ϵ (3500,4100)$ , set $f l o w_{a i r} = 700 m^{3} / h$ and $f l o w_{w a t e r} = 450 k g / h$ ; when $C L ϵ (4100, \infty)$ , set $f l o w_{a i r} = 900 m^{3} / h$ and $f l o w_{w a t e r} = 550 k g / h$ .
DQN control method: In MDP modeling, the setting of state, action and reward is the same as that of DDPG. The value range of water flow for pump is 80–700 kg/h, and the step size is 5 $k g / h$ . The value range of air flow for fan is 100–1200 m³/h, and the step size is 5 m³/h. The policy network and target network of the agent are composed of two hidden layers containing 32 neurons. The training batch $b a t c h_s i z e$ is set to 32, and the discount factor $γ$ is set to 0.01. The update step $C_{s t e p}$ is set to 200; the learning rate is set to 0.01; and the experience pool capacity $D$ is set to 2000. Detailed parameter settings are shown in Table 5.

In summary, RBC is suitable for simple scenarios without dynamic adjustment, but lacks flexibility and optimization capabilities; MBC is suitable for systems with accurate models and can provide optimal control effects, but it requires high-quality models and data. DDPG and FS-DDPG are more suitable for dynamically changing complex systems, and FS-DDPG further improves the robustness of the FCU control process. DQN performs well in discrete action spaces and is capable of handling relatively less complex systems, but it struggles with stability in highly dynamic environments due to its reliance on a value-based approach and limited exploration–exploitation balance.

5.3. Experimental Results

In order to evaluate the performance of the proposed algorithm, we present the analysis of algorithm convergence, energy consumption analysis, system operational performance analysis, system safety analysis and satisfaction analysis. In terms of room layout, temperature and humidity, insulation and other aspects, room 1-N-6 is typical of all rooms, so choosing room 1-N-6 for experiments can ensure the universality of the algorithm.

5.3.1. Analysis of Algorithm Convergence

Figure 10 presents the trend of cumulative rewards over a decade using the FS-DDPG and DDPG control methods, where the abscissa is the year in the training process, and the ordinate is the reward value. The solid line in Figure 10 represents the average reward value of the 10 independent experiments, and the shadow represents the error band in the 10 experiments, where the green is the experimental result under FS-DDPG, and the blue is the experimental result of DDPG. In DRL, the trend of cumulative reward can indicate the convergence of the algorithm, and the reward reflects the effectiveness of the control to some extent. As can be seen from Figure 10, both the DDPG and FS-DDPG methods initially exhibit wide error bands. With the optimization of the algorithm, the error band is gradually narrowed. Around the fifth year, the reward value converges; by the end of the eighth year, the error band basically converges, which indicates that both DDPG and FS-DDPG can converge. At the end of training, the reward value of FS-DDPG was slightly higher than that of DDPG, which also indicated that the control effect of FS-DDPG was better.

Furthermore, FS-DDPG is a model-free RL algorithm, and once the system converges and stabilizes, the control parameters will stabilize as well. Compared to traditional methods, FS-DDPG can reduce equipment wear and maintenance needs. If the system needs adjustments in the future, re-training the model is sufficient, without additional equipment maintenance. This enables FS-DDPG to effectively reduce maintenance costs and extend equipment lifespan during long-term operation.

5.3.2. Energy Consumption Analysis

Figure 11 shows the average energy consumption of the five control methods based on 10 independent experiments, where the abscissa is the years, and the ordinate is the system energy consumption. As can be seen from the red curve in Figure 11, with the progress of training, the total energy consumption of the system under the control of FS-DDPG has a huge reduction. In the first two years, the energy consumption of FS-DDPG and DDPG was higher than that of RBC. However, after the optimization of the algorithm, the energy consumption of FS-DDPG in the test set was 1080.75 kWh, and compared with DDPG, RBC and DQN, the energy saving rate reached 11.90%, 51.76% and 37.61%, respectively. Compared with DQN, FS-DDPG demonstrates more significant energy savings. While DQN performs well in systems with discrete action spaces, it struggles in highly dynamic environments, leading to less efficient control and higher energy consumption in comparison to FS-DDPG. Therefore, the results prove that the application of FS-DDPG has a positive effect on the energy saving control of FCU. With the respect to the total energy consumption, compared with MBC (the optimal control method), FS-DDPG and DDPG have a similar energy saving performance. But, from the 1st year to the 10th year, FS-DDPG outperforms DDPG all of the time, where the reason is that, for all fluctuating action points, FS-DDPG allows for better consistency in the water and air flow of the system and lower energy consumption in practical applications.

In addition, we conducted the experiment using a separate test set (NOAA 2022 weather data) [27], representing a real test scenario, the results of which can be seen in Year 11 (test set). It can be seen from Figure 11 that the experimental results of the test set are very close to the energy consumption of the 9th and 10th years, which indicates that the performance improvement achieved in the training process is not only limited to the training set, but also applies to the new data. Test set experiments help to address potential problems of overfitting simulated environments.

Table 6 shows the confidence interval for energy consumption (kWh) of the five methods in each year. The confidence interval (CI) is calculated as shown in Equation (16):

C I = μ \pm Z * \frac{S D}{\sqrt{n}} #

(16)

where

S D

is the standard deviation,

n

is the amount of data per year and

Z

is the Z-value of the 95% confidence interval, usually taken as 1.96.

As training progresses, the energy consumption confidence interval of FS-DDPG narrows overall, demonstrating a significant improvement in the stability of its performance. The confidence interval of FS-DDPG is consistently smaller than that of DQN, DDPG and RBC, with a notable example in the tenth year: the intervals for FS-DDPG, DDPG, RBC and DQN are (925.97, 1034.22), (1104.88, 1315.05), (2281.93, 2668.55) and (1642.64, 1852.81), respectively. This indicates that FS-DDPG performs more effectively in reducing energy consumption.

5.3.3. System Operation Performance Analysis

With the progress of training, FS-DDPG and DDPG are approaching the optimal control method. We describe this process in terms of the relative error

G

. The calculation equation of relative error

G

is as shown in Equation (17):

\begin{matrix} G = \frac{|a - a^{'}|}{a^{'}} * 100 % \end{matrix}

(17)

where

a

is the output water flow under FS-DDPG or DDPG, and

a'

is the output water flow under MBC.

Figure 12 shows the annual average relative error distribution of FS-DDPG, DDPG and MBC in water flow control, with the vertical axis representing the relative error (%). By selecting specific years for comparison, it can be observed that the distribution of annual average relative errors in water flow varies significantly. The blue and orange squares represent most of the normal relative error distribution, indicating that FS-DDPG, DDPG and MBC have similar effects in controlling water flow. The gray dots above the upper edge (horizontal line) represent outliers, indicating that DDPG, FS-DDPG and MBC have obvious differences in the control effect. During the learning process,

G

become more concentrated and smaller, which indicates that the control effect of these two algorithms moves closer and closer to MBC in general. And the outliers in the

G

between FS-DDPG and MBC are decreasing, indicating that the selected action of FS-DDPG could be closer to MBC. On the contrary, although

G

between DDPG and MBC becomes smaller and concentrated, there are still many outliers at the end of training. This situation indicates that there is still a big gap between DDPG and MBC in water flow control, which is mainly caused by the lack of constraint on fluctuation action in DDPG.

Where

a

is the output air flow under FS-DDPG or DDPG, and

a_{M B C}

is the output air flow under MBC.

Figure 13 shows the annual average relative error distribution of FS-DDPG, DDPG and MBC in air flow control, with the vertical axis representing the relative error (%). By selecting specific years for comparison, it can be observed that the distribution of annual average relative errors in air flow varies significantly. The blue and orange squares represent most of the normal relative error distribution, indicating that FS-DDPG, DDPG and MBC have similar effects in controlling air flow. The gray dots above the upper edge (horizontal line) represent outliers, indicating that DDPG, FS-DDPG and MBC have obvious differences in the control effect. During the learning process,

G

become more concentrated and smaller, which indicates that the control effect of these two algorithms moves closer and closer to MBC in general. And the outliers in the

G

between FS-DDPG and MBC are decreasing, indicating that the selected action of FS-DDPG could be closer to MBC. On the contrary, although

G

between DDPG and MBC becomes smaller and concentrated, there are still many outliers at the end of training. This situation indicates that there is still a big gap between DDPG and MBC in air flow control, which is mainly caused by the lack of constraint on fluctuation action in DDPG.

Additionally, when the relative error of water or air flow

(G)

is less than 0.5, it is considered to satisfy the action consistency (AC). The formula for calculating action consistency is as follows:

A C = \frac{{A m o u n t}_{c o n s i s t e n t a c t i o n}}{{A m o u n t}_{t o t a l a c t i o n}} * 100 % #

(18)

where

{A m o u n t}_{c o n s i s t e n t a c t i o n}

represents the amount of water or air flow set points in a year that satisfy the action consistency condition, and

{A m o u n t}_{t o t a l a c t i o n}

represents the total amount of water or air flow set points in that year.

Table 7 shows the annual water and air flow action consistency analysis. As shown in Table 7, FS-DDPG shows a year-on-year improvement in action consistency for both water flow and air flow control. In terms of water flow, FS-DDPG reaches 89.96% in the tenth year, a significant improvement compared to DDPG’s 83.74%. For air flow, FS-DDPG also outperforms DDPG, with an action consistency of 89.23% in the tenth year, while DDPG stands at 86.59%. This indicates that FS-DDPG performs better than DDPG in terms of control accuracy and consistency for both water and air flows, showing results closer to the control performance of MBC.

5.3.4. System Safety Analysis

By comparing the changes between annual average water flow relative errors across different years, we try to analyze the performance of system stability for five control methods. The relative error is calculated in Equation (17), where

a

is

a_{t}^{w a t e r}

,

a^{'}

is

a_{t - 1}^{w a t e r}

.

a_{t}^{w a t e r}

is the water flow of the system at moment

t

and

a_{t - 1}^{w a t e r}

is the water flow of the system at moment

t - 1

. Figure 14 presents the annual average relative error distribution in water flow for the five control methods: FS-DDPG, DDPG, MBC, DQN and RBC, based on some selected years, showing significant changes in the relative error of average water flow. The abscissa is the five methods, and the vertical axis is the annual average relative error of the water flow (%). In the first year, for MBC and RBC, most of the values concentrate near zero. The reason is that MBC is the optimal control method, and RBC is a rule-based control method. For DDPG, DQN and FS-DDPG, the error value distribution is relatively scattered (between 0 and 650%) while partial actions exist with considerable error values. The reason is that the agent has selected actions with significant differences before and after. As training progresses, the outliers (large error values) for FS-DDPG gradually decrease (the error values of FS-DDPG gradually converge towards zero), which indicates that FS-DDPG’s constraints on action fluctuations are increasingly effective. However, the error value distribution for DDPG and DQN remains dispersed, with some actions exhibiting considerable variability (such as the error distribution of DDPG ranging from 0 to 620% in the 10th year). At the end of the training, the FS-DDPG graph approaches that of MBC and RBC (the error values of FS-DDPG are concentrated between 0 and 1.8), while there is no significant change in the DDPG and DQN. In the test set, FS-DDPG reduced the dramatic fluctuations of water flow (

G

> 1) by 98.20% compared to DDPG. This indicates that there are no constraints on fluctuating actions in DDPG, which also demonstrates that constraints on actions are helpful for the algorithm to converge to a thorough strategy.

In addition, to ensure the robustness of the results, we conducted the experiment using a separate test set (NOAA 2022 weather data) [27] representing a real test scenario, the result of which can be seen in Year 11 (test set). By comparing the subgraphs, it can be found that the distribution of water flow error range in the test set closely aligned with the distribution of the 8th and 10th years, which indicates that the performance improvements achieved during training are not limited to the training set, but also apply to new data. By incorporating Year 11 as a test set, we address potential concerns of overfitting to the simulated environment and strengthen the applicability of the proposed control methods to real-world scenarios.

By comparing the changes between annual average air flow relative errors across different years, the performance of the five control method in terms of system stability can be analyzed, where

a

is

a_{t}^{a i r}

,

a^{'}

is

a_{t - 1}^{a i r}

and

a_{t}^{a i r}

is the air flow of the system at moment

t

,

a_{t - 1}^{a i r}

is the air flow of the system at moment

t - 1

. Figure 15 presents the annual average relative error distribution in air flow for the five control methods: FS-DDPG, DDPG, MBC, DQN and RBC, based on some selected years, showing significant changes in the relative error of average air flow. The abscissa is the five methods, and the vertical axis is the annual average relative error of the air flow (%). In the first year, for MBC and RBC, most of the values concentrate near zero. The reason is that MBC is the optimal control method, and RBC is a rule-based control method. For DDPG, DQN and FS-DDPG, the error value distribution is relatively scattered (between 0 and 650%) while partial actions exist with considerable error values. The reason is that the agent has chosen actions with significant differences before and after. As training progresses, the outliers (large error values) for FS-DDPG gradually decrease (the error values of FS-DDPG gradually converge towards zero), which indicates that FS-DDPG’s constraints on action fluctuations are increasingly effective. However, the error value distribution for DDPG and DQN remains dispersed, with some actions exhibiting considerable variability (such as the error distribution of DDPG ranging from 0 to 630% in the 10th year). At the end of the training, the FS-DDPG graph approaches that of MBC and RBC (the error values of FS-DDPG are concentrated between 0 and 1.9), while there is no significant change in the DDPG and DQN graphs. In the test set, FS-DDPG had reduced the dramatic fluctuations of air flow (

G

> 1) by 95.82% compared to DDPG. This indicates that there were no constraints on fluctuating actions in DDPG, which also demonstrates that constraints on actions are helpful for the algorithm to converge to a thorough strategy.

In addition, to ensure the robustness of the results, we conducted the experiment using a separate test set (NOAA 2022 weather data) [27] representing a real test scenario, the result of which can be seen in Year 11 (test set). By comparing the subgraphs, it can be found that the distribution of air flow error range in the test set closely aligned with the distribution of the 7th and 10th years, which indicates that the performance improvements achieved during training are not limited to the training set, but also apply to new data. By incorporating Year 11 as a test set, we address potential concerns of overfitting to the simulated environment and strengthen the applicability of the proposed control methods to real-world scenarios.

Table 8 shows the comparison between DDPG and FS-DDPG control methods in the fluctuation of water flow and air flow. By calculating the reduction proportion of fluctuating flow for FS-DDPG compared to DDPG, it reflects the stability advantage of FS-DDPG in the control process. Particularly in the later years, FS-DDPG shows a significant advantage in reducing fluctuations. In the test set, compared to DDPG, FS-DDPG reduced water flow fluctuations by 98.20% and air flow fluctuations by 95.82%. These results indicate that FS-DDPG can more effectively and smoothly control the system, reducing extreme fluctuations.

5.3.5. Satisfaction Analysis

In the FCU system, it is crucial to maintain consistency between the real cooling capacity and the required cooling load. We use the cooling load deviation to represent the relationship between real cooling capacity and required cooling load, as shown in Equation (19):

\begin{matrix} Δ C L = | {C L}_{t}^{F C U} - C L_{t} | \end{matrix}

(19)

where

Δ C L

represents the cooling load deviation at moment

t

,

{C L}_{t}^{F C U}

is the cooling capacity at moment

t

and

C L_{t}

is the expected cooling load at moment

t

. We use the daily cooling load deviation to validate the FCU system’s ability of meeting during the cooling season over ten years. Figure 16 shows the cooling load deviation in selected six years (these six years have evident cooling load deviations), where the abscissa is the number of days in the cooling season and the ordinate is the time (only considering human activity time: 7:00–21:00). Lighter-colored areas indicate smaller cooling load deviations, while darker-colored areas indicate larger deviations. In the first year, it can be clearly found that there are many dark squares with large

Δ C L

values, which represent large deviations between actual cooling capacities and expected cooling loads. During the learning process, the number of dark squares gradually decreases, which not only indicates that the time period with a large cooling load deviation is decreasing, but also reflects that FS-DDPG is constantly optimized and finally shows a good performance. However, in all ten years, there are still some dark-colored squares. The presence of these dark areas indicates that the system is unable to fully meet the cooling load requirements at certain moments, which may be related to external climatic conditions, internal load fluctuations or system maintenance.

Table 9 shows the trend of the satisfaction rate (SR) for each method. The satisfaction rate is calculated in Equation (20):

S R = \frac{{S a t i s f i e l d}_{h o u r s}}{{t o t a l}_{h o u r s}} * 100 % #

(20)

where

{t o t a l}_{h o u r s}

presents the total hours in a year and

{S a t i s f i e l d}_{h o u r s}

represents the number of hours in which the cooling load demand is met. When

Δ C L

is less than 10% of

C L_{t}

, it is considered to meet the current actual cooling capacity demand.

5.3.6. Sensitivity Analysis

In order to further evaluate the stability of the control strategy and system performance, we perform sensitivity analysis on several key parameters using the test set, focusing on the change in penalty term weights and constraint tightening thresholds. We adjusted the values of these parameters and analyzed their impact on system performance.

Table 10 is a performance comparison of FS-DDPG with different penalty term weights and constraint tightening thresholds. In all the tested combinations in Table 7, FS-DDPG performs best when the penalty term weight is 10.0 and the constraint tightening threshold is 1.0. In this case, the fluctuation reduction ratios for water flow and air flow are 98.2% and 95.82%, respectively, with an energy consumption of 1080.75 kWh and a satisfaction rate (SR) of 84.33%. In contrast, other combinations generally show higher energy consumption and lower satisfaction rates. Specifically, when the penalty term weight increases (

λ = 15)

or the constraint tightening threshold decreases (

β = 0.5)

, the fluctuation reduction ratio remains relatively constant, but both energy consumption and the satisfaction rate worsen. This is because the agent tends to choose more conservative actions, sacrificing energy efficiency and satisfaction to ensure smoother actions. Overall, the combination of a penalty term weight of 10.0 and a constraint tightening threshold of 1.0 achieves the best balance between satisfaction rate, energy saving and safety performance.

6. Conclusions and Future Work

When traditional RL methods are applied to control the FCU, significant flow fluctuations may occur, leading to system overload, increased energy consumption and even equipment damage. To address this challenge, we model the FCU control process as a constrained Markov decision process, limiting the agent’s action space by incorporating a penalty term for process constraints within the reward function. Based on this, we propose the FS-DDPG algorithm based on SRL to optimize the control strategy for FCU. To validate the performance of the proposed algorithm, we utilized DeST software to construct a model of office rooms based on real historical weather data and calculate the corresponding cooling load demand. A heat and mass transfer model for the FCU was established, effectively validating the accuracy and feasibility of the model. The experimental results show the following:

Compared with DDPG, FS-DDPG can suppress 98.20% of pump flow fluctuations and 95.82% of fan airflow fluctuations. Additionally, compared with RBC and DDPG, FS-DDPG achieves energy savings of 11.9% and 51.76%, respectively.
The proposed method shows performance and satisfaction levels very close to those of MBC, indicating that FS-DDPG is highly adaptable to dynamic environments, able to meet the indoor cooling load requirements, reduce system energy consumption and ensure long-term stable operation of the system.

In the early stages of the learning process, due to the exploration mechanism of the algorithm and dynamic changes in the environment, there are still some large fluctuations in water and air flow. The Lyapunov method may offer a solution, but balancing stable system operation with efficiency remains challenging. This will be a focus for future research. Furthermore, in future studies, we plan to further extend the FS-DDPG method to explore its application in more complex and variable FCU systems, especially for multi-region FCU systems. At the same time, we will also consider the performance of the system in different environments, especially how to ensure the balance between system stability and energy efficiency under abnormal conditions or uncertain factors.

Author Contributions

Conceptualization, C.L.; data curation, C.L., Q.F. and Y.L.; formal analysis, C.L. and Q.F.; funding acquisition, Q.F., J.C. and Y.W.; investigation, J.C., Y.W., Y.L. and C.L.; methodology, C.L. and Q.F.; project administration, J.C.; software, C.L. and Q.F.; supervision, Q.F., J.C., Y.W., Y.L. and H.W.; validation, C.L. and Q.F.; writing—original draft, C.L.; writing—review and editing, C.L., Q.F., Y.W., Y.L. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the National Key R&D Program of China (No. 2020YFC2006602), the Foundation of Engineering Research Center of Construction Carbon Neutral Technology of Jiangsu Province (No. JZTZH2023-0402), the National Natural Science Foundation of China (No. 62372318, No. 62102278, No. 62072324), the University Natural Science Foundation of Jiangsu Province (No. 21KJA520005) and the Science and Technology Development Project of Suzhou under grant SGC2021078.

Data Availability Statement

The experiment results are available at https://github.com/leecy123123/fcu_1.git (accessed on 10 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Geometric structural parameters of the FCU.

Parameter Name	Value
Fin thickness $δ_{f}$	0.2 mm
Tube wall thickness $δ_{t}$	1 mm
Pipe outside diameter $d_{o}$	10 mm
Pipe spacing perpendicular to the airflow $s_{1}$	25 mm
Spacing of pipes parallel to the airflow $s_{2}$	20 mm
Number of rows of tubes parallel to the air flow $n_{p}$	2
Number of rows of tubes perpendicular to the air flow $n_{v}$	8
Number of waterways $n_{w}$	2
Spacing of fins $s_{f}$	2.2 mm
Effective length of coil heat transfer $L_{e}$	1 m

Appendix B

Most FCU systems are in steady or metastable operation, and as a validation of the control algorithm, the following assumptions are made: the outer side of the tube wall is regarded as uniform heat absorption or heat release; the heat state of the air and the outer fin is the lumped parameter of the whole space; we can ignore the radiation heat transfer of the heat exchanger; we do not consider the axial heat conduction of the tube wall and water. The input parameters of the specific fan coil system model are as follows: inlet air drying bulb temperature

t_{1}

, inlet wet bulb temperature

t_{w e t 1}

, inlet water temperature

t_{w 1}

, water flow

f l o w_{w a t e r}

and air flow

f l o w_{a i r}

.

The specific calculation details are as follows:

1.: Calculate $Q_{1}$ and $h_{2}$ .

First, assume the outlet water temperature

t_{w 2}

, then calculate the cooling capacity

Q_{1}

according to Equation (A1) and compute the air enthalpy

h_{2}

using the air-side energy conservation Equation (A2).

\begin{matrix} \begin{matrix} Q_{1} = f l o w_{w a t e r} \times C_{w m} \times (t_{w 2} - t_{w 1}) \end{matrix} \end{matrix}

(A1)

\begin{matrix} h_{2} = h_{1} - \frac{Q_{1}}{ρ_{1} \times f l o w_{a i r}} \end{matrix}

(A2)

where

f l o w_{w a t e r} (k g / s)

is the water flow rate,

C_{w m} (J / (k g \cdot K))

is the average specific heat capacity of water,

t_{w 1} (°C)

is the inlet water temperature,

ρ_{1} (k g / m^{3})

is the inlet air density,

f l o w_{a i r} (m^{3} / s)

is the circulating air volume and

h_{1} (J / k g)

is the intake enthalpy.

2.: Calculate $t_{w a l l i n}$ and $t_{w a l l o u t}$ to determine dry or wet conditions.

According to Equation (A3), calculate the inner wall temperature of the copper tube

t_{w a l l i n} (°C)

, and then calculate the outer wall temperature of the copper tube

t_{w a l l o u t} (°C)

using Equation (A4). Compare the outer wall temperature of the copper tube with the dew point temperature of the air entering the FCU,

t_{d 1} (°C)

. If

t_{w a l l o u t} < t_{d 1}

, it indicates that the FCU system is operating under wet conditions; otherwise, it is operating under dry condition.

\begin{matrix} \begin{matrix} t_{w a l l i n} = \frac{Q_{1}}{(α_{w a t e r} \times L_{t} \times f_{i})} + \frac{t_{w 1} + t_{w 2}}{2} \end{matrix} \end{matrix}

(A3)

\begin{matrix} \begin{matrix} t_{w a l l o u t} = \frac{Q_{1} \times l n \frac{d_{o}}{d_{i}}}{2 π λ_{t} L_{t}} + t_{w a l l i n} \end{matrix} \end{matrix}

(A4)

Here,

α_{w a t e r} (W / (m^{2} \cdot °C))

is the heat transfer coefficient on the water side,

f_{i} (m^{2} / m)

is the heat transfer surface area per unit length of the inner surface of the tube,

L_{t} (m)

is the total length of the heat exchange tube,

λ_{t} (W / (m \cdot °C))

is the thermal conductivity of the purple copper tube and

d_{i}

and

d_{o}

are the inner and outer diameters of the heat exchange tube, respectively.

3.: Calculate the wind parameters.

Since dry conditions only involve heat transfer and no mass transfer, the moisture content of the air

d_{2}

remains unchanged, as shown in Equation (A5). When the process involves mass transfer,

d_{2}

can be calculated using Equation (A6). Then,

d_{2}

is substituted into Equation (A7) to calculate

t_{2}

. Implement the solution for the conversion of the thermophysical properties of moist air (such as moisture content, dry bulb temperature, wet bulb temperature, humidity, dew point temperature, specific enthalpy, etc.) by directly calling CoolProp in Python.

\begin{matrix} \begin{matrix} d_{2} = d_{1} \end{matrix} \end{matrix}

(A5)

\begin{matrix} \begin{matrix} d_{2} = d_{1} - (d_{1} - d_{w}) \cdot \frac{h_{1} - h_{2}}{h_{1} - h_{w}} \end{matrix} \end{matrix}

(A6)

\begin{matrix} \begin{matrix} t_{2} = \frac{h_{2} - I d_{2}}{C p_{a} + C p_{v} d_{2}} \end{matrix} \end{matrix}

(A7)

Among them,

d_{1} (g / k g)

represents the moisture content of the inlet air;

h_{w} (J / k g)

is the saturation enthalpy corresponding to the dew point temperature of the machine; and

d_{w} (g / k g)

is the saturation moisture content corresponding to the machine’s dew point temperature.

t_{2} (°C)

is the dry bulb temperature of the outlet air;

C p_{a}

is the specific heat capacity at constant pressure for dry air, taken as 1.006

k J / (k g \cdot °C)

;

C p_{v}

is the specific heat capacity at constant pressure for water vapor, taken as 1.86

k J / (k g \cdot °C)

; and

I

is the latent heat of vaporization required for every kilogram of water at 0 °C to become water vapor at 0 °C, taken as 2501

k J / k g

.

4.: Calculate $Q_{2}$ and judge the correctness of the iterative hypothesis.

Calculate the cooling capacity

Q_{2}

according to Equations (A8) and (A9) (where Equation (A8) is for dry conditions and Equation (A9) is for wet conditions). Check if the inequality

2 |(Q_{1} - Q_{2}) / (Q_{1} + Q_{2})| < ε

holds true. If the inequality is satisfied, then the assumption is correct and the iteration ends. Otherwise, reassume

t_{w 2}

and return to Step 1 to recalculate until the condition is met.

\begin{matrix} \begin{matrix} Q_{2} = η_{s} α_{a i r} L_{t} f_{t} (\frac{t_{1} + t_{2}}{2} - t_{wallout}) \end{matrix} \end{matrix}

(A8)

\begin{matrix} \begin{matrix} Q_{2} = ξ η_{s} α_{a i r} L_{t} f_{t} (\frac{t_{1} + t_{2}}{2} - t_{wallout}) \end{matrix} \end{matrix}

(A9)

where

ξ

represents the dehumidification factor;

f_{t} (m^{2} / m)

is the total external heat exchange surface area per unit length of tube in the fan coil;

η_{s}

is the efficiency of the aluminum fins on the surface of the fan coil; and

α_{a i r} (W / (m^{2} \cdot °C))

is the overall air side heat transfer coefficient.

5.: System power consumption.

The power consumption of the fan coil system consists of the power consumption of the water pump and the fan. Both the power consumption of the water pump and the fan are related to the airflow and water flow in the fan coil system. In Figure A1, the original values of the airflow and water flow are represented by dots, and their corresponding cubic fitting values are represented by curves.

Figure A1. Water/air flow power diagram.

The fitting function in Figure A1 is obtained by cubic polynomial regression with Equation (A10).

\begin{matrix} \begin{matrix} P = c o e f f_{2} \cdot f l o w^{2} + c o e f f_{1} \cdot f l o w + c o e f f_{0} \end{matrix} \end{matrix}

(A10)

where flow represents the volume flow of wind

m^{3} / h

or the mass flow of water

k g / h

, and

P (w)

represents power. Table A2 lists the fitting coefficients of the power consumption related to air flow and water flow obtained by data regression according to Equation (A8).

Table A2. Cubic fitting coefficient of power consumption and flow.

Air Coefficient	Value	Water Coefficient	Value
$c o e f f_{2}$	3.22086094 × 10⁻⁵	$c o e f f_{2}$	3.68380722 × 10⁻⁴
$c o e f f_{1}$	3.05481795 × 10⁻²	$c o e f f_{1}$	−2.43551880 × 10⁻²
$c o e f f_{0}$	3.56900355e × 10¹	$c o e f f_{0}$	2.48530002 × 10¹

Appendix C

Table A3. Average error of calculation result.

Generated Quantity	Relative Error
Generated Quantity	Dry Condition	Wet Condition
Cooling capacity	6.9045	13.8871
Air-drying bulb temperature	4.1135	6.6515
Outgoing air humidity ball temperature	4.0281	5.6513
Effluent temperature	0.4210	6.1226

The accuracy of the simulation model is verified from the perspectives of heat transfer, the air side and the water side. The verification from the heat transfer perspective is reflected in the error of the cooling capacity; the air-side verification is reflected in the error of the outlet air dry bulb and wet bulb temperatures (under standard atmospheric pressure conditions); and the water-side verification is reflected in the error of the outlet water temperature. The FCU system operating under wet working conditions shows greater errors than under dry conditions (mainly reflected in the cooling capacity and outlet water temperature). The primary reason is that idealized assumptions have been used in the heat and mass transfer calculations under wet working conditions, treating the dew point temperature of the machine as the external wall temperature of the FCU. Overall, the simulation results of the FCU model align with experimental outcomes and can be used to validate the model for control algorithms.

References

Ding, Z.K.; Fu, Q.M.; Chen, J.P.; Wu, H.J.; Lu, Y.; Hu, F.Y. Energy-efficient control of thermal comfort in multi-zone residential HVAC via reinforcement learning. Conn. Sci. 2022, 34, 2364–2394. [Google Scholar] [CrossRef]
World Energy Outlook 2023. Available online: www.iea.org/terms (accessed on 8 January 2025).
Fang, Z.; Tang, T.; Su, Q.; Zheng, Z.; Xu, X.; Ding, Y.; Liao, M. Investigation into optimal control of terminal unit of air conditioning system for reducing energy consumption. Appl. Therm. Eng. 2020, 177, 115499. [Google Scholar] [CrossRef]
Dezfouli, M.M.S.; Dehghani-Sanij, A.R.; Kadir, K.; Suhairi, R.; Rostami, S.; Sopian, K. Is a fan coil unit (FCU) an efficient cooling system for net-zero energy buildings (NZEBs) in tropical regions? An experimental study on thermal comfort and energy performance of an FCU. Results Eng. 2023, 20, 101524. [Google Scholar] [CrossRef]
Kou, X.; Du, Y.; Li, F.; Pulgar-Painemal, H.; Zandi, H.; Dong, J.; Olama, M.M. Model-based and data-driven HVAC control strategies for residential demand response. IEEE Open Access J. Power Energy 2021, 8, 186–197. [Google Scholar] [CrossRef]
In Proceedings of the 2019 American Control Conference (ACC), Philadelphia, PA, USA, 10–12 July 2019. Available online: https://ieeexplore.ieee.org/xpl/conhome/8789884/proceeding (accessed on 10 January 2025).
Fu, Q.; Han, Z.; Chen, J.; Lu, Y.; Wu, H.; Wang, Y. Applications of reinforcement learning for building energy efficiency control: A review. J. Build. Eng. 2022, 50, 104165. [Google Scholar] [CrossRef]
Lee, D.; Ooka, R.; Ikeda, S.; Choi, W.; Kwak, Y. Model predictive control of building energy systems with thermal energy storage in response to occupancy variations and time-variant electricity prices. Energy Build. 2020, 225, 110291. [Google Scholar] [CrossRef]
Yang, S.; Wan, M.P. Machine-learning-based model predictive control with instantaneous linearization—A case study on an air-conditioning and mechanical ventilation system. Appl. Energy 2022, 306, 118041. [Google Scholar] [CrossRef]
Ding, Z.; Huang, Y.; Yuan, H.; Dong, H. Introduction to reinforcement learning. In Deep Reinforcement Learning: Fundamentals, Research and Applications; Springer: Berlin/Heidelberg, Germany, 2020; pp. 47–123. [Google Scholar]
Zhou, S.L.; Shah, A.A.; Leung, P.K.; Zhu, X.; Liao, Q. A comprehensive review of the applications of machine learning for HVAC. DeCarbon 2023, 2, 100023. [Google Scholar] [CrossRef]
Esrafilian-Najafabadi, M.; Haghighat, F. Occupancy-based HVAC control systems in buildings: A state-of-the-art review. Build. Environ. 2021, 197, 107810. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, X.; Fu, Q.; Chen, J.; Wang, Y.; Lu, Y.; Liu, L. Priori knowledge-based deep reinforcement learning control for fan coil unit system. J. Build. Eng. 2023, 82, 108157. [Google Scholar] [CrossRef]
Qiu, S.; Li, Z.; Li, Z.; Li, J.; Long, S.; Li, X. Model-free control method based on reinforcement learning for building cooling water systems: Validation by measured data-based simulation. Energy Build. 2020, 218, 110055. [Google Scholar] [CrossRef]
Gao, C.; Wang, D. Comparative study of model-based and model-free reinforcement learning control performance in HVAC systems. J. Build. Eng. 2023, 74, 106852. [Google Scholar] [CrossRef]
Han, Z.; Fu, Q.; Chen, J.; Wang, Y.; Lu, Y.; Wu, H.; Gui, H. Deep Forest-Based DQN for Cooling Water System Energy Saving Control in HVAC. Buildings 2022, 12, 1787. [Google Scholar] [CrossRef]
Li, S.; Wei, M.; Wei, Y.; Wu, Z.; Han, X.; Yang, R. A fractional order PID controller using MACOA for indoor temperature in air-conditioning room. J. Build. Eng. 2021, 44, 103295. [Google Scholar] [CrossRef]
Li, S.; Wang, D.; Han, X.; Cheng, K.; Zhao, C. Auto-tuning parameters of fractional PID controller design for air-conditioning fan coil unit. J. Shanghai Jiaotong Univ. (Sci.) 2021, 26, 186–192. [Google Scholar] [CrossRef]
Verhelst, J.; Van Ham, G.; Saelens, D.; Helsen, L. Model selection for continuous commissioning of HVAC-systems in office buildings: A review. Renew. Sustain. Energy Rev. 2017, 76, 673–686. [Google Scholar] [CrossRef]
Zhao, A.; Wei, Y.; Quan, W.; Xi, J.; Dong, F. Distributed model predictive control of fan coil system. J. Build. Eng. 2024, 94, 110028. [Google Scholar] [CrossRef]
Sanama, C.; Xia, X.; Nguepnang, M. PID-MPC Implementation on a Chiller-Fan Coil Unit. J. Math. 2022, 2022, 8405361. [Google Scholar] [CrossRef]
Martinčević, A.; Vašak, M.; Lešić, V. Identification of a control-oriented energy model for a system of fan coil units. Control Eng. Pract. 2019, 91, 104100. [Google Scholar] [CrossRef]
Guillen, D.P.; Anderson, N.; Krome, C.; Boza, R.; Griffel, L.M.; Zouabe, J.; Al Rashdan, A.Y. A RELAP5-3D/LSTM model for the analysis of drywell cooling fan failure. Prog. Nucl. Energy 2020, 130, 103540. [Google Scholar] [CrossRef]
Lin, C.M.; Liu, H.Y.; Tseng, K.Y.; Lin, S.-F. Heating, ventilation, and air conditioning system optimization control strategy involving fan coil unit temperature control. Appl. Sci. 2019, 9, 2391. [Google Scholar] [CrossRef]
Shafighfard, T.; Kazemi, F.; Asgarkhani, N.; Yoo, D.Y. Machine-learning methods for estimating compressive strength of high-performance alkali-activated concrete. Eng. Appl. Artif. Intell. 2024, 136, 109053. [Google Scholar] [CrossRef]
Chen, C.; An, J.; Wang, C.; Duan, X.; Lu, S.; Che, H.; Qi, M.; Yan, D. Deep Reinforcement Learning-Based Joint Optimization Control of Indoor Temperature and Relative Humidity in Office Buildings. Buildings 2023, 13, 438. [Google Scholar] [CrossRef]
NOAA. National Centers for Environmental Information. Available online: http://www.noaa.gov (accessed on 8 January 2025).

Figure 1. The traditional reinforcement learning method controls part of the water flow and air flow in an FCU system.

Figure 2. Fan coil system working diagram.

Figure 3. Control flow.

Figure 4. Room model.

Figure 5. Cooling load data for room 1-N-6 in 2021.

Figure 6. Input and output simulation model of FCU.

Figure 7. Markov decision model of fan coil system.

Figure 8. FS-DDPG control method based on SRL.

Figure 9. RBC sequence decision control logic.

Figure 10. Cumulative reward.

Figure 11. Average annual power consumption comparison.

Figure 12. Average annual relative error distribution of water flow by FS-DDPG, DDPG and MBC.

Figure 13. Average annual relative error distribution of air flow by FS-DDPG, DDPG and MBC.

Figure 14. Annual relative error of water flow.

Figure 15. Annual relative error of air flow.

Figure 16. Cooling load deviation under the FS-DDPG control method.

Table 1. Basic room attributes configuration.

Room Number	Room Function	Room Area	Interior Design Temperature
1-N-1	Lounge	32	26
1-N-2	Office	96	25
1-N-3	Office	96	25
1-N-4	Office	32	25
1-N-5	Conference Room	96	26
1-N-6	Office	64	25
1-N-7	Office	64	25
1-N-8	Office	64	25
1-N-9	Office	64	25
1-N-10	Office	64	25
1-N-11	Supply Room	32	No Set
1-N-12	Hallway	176	No Set

Table 2. Statistical analysis of weather data from 2012 to 2022.

Metrics	Maximum Value	Minimum Value	Median	Mean Value	Unit
2 m air temperature	37.26	−7.61	18.59	17.82	°C
Surface pressure	1041.42	981.76	1016.41	1015.94	$h P a$
Surface temperature	38.38	−8.08	17.98	17.17	°C
Dew point temperature	28.93	20.55	13.66	13.34	°C
Relative humidity	99.99	15.09	80.72	76.93	%
East wind speed	8.84	10.95	−0.41	0.32	$m / s$
North wind speed	8.38	−13.50	−1.03	0.75	$m / s$
Total solar irradiance	3653.08	0	20.46	583.70	$J / m^{2}$
Net solar irradiance	3096.26	0	17.71	499.52	$J / m^{2}$
Amount of precipitation	14.24	0	0	0.18	$m m$
Evaporation capacity	0.04	0.76	-0.04	0.11	$m m$
Ultraviolet intensity	410.87	0	2.74	65.62	$J / m^{2}$

Table 3. Network parameter setting.

Network	Number of Neurons	Activation Function	Learning Rate	Optimizer
Actor	1→64	ReLU	0.002	Adam
Actor	64→2	Tanh	0.002	Adam
Critic	[1→64], [2→64]	ReLU	0.001	Adam
Critic	64→1	ReLU	0.001	Adam

Table 4. Hyperparameter setting.

Hyperparameter	Value
Soft renewal coefficient $τ$	0.01
Replay pool size	2000
Sample batch size	64
Discount factor $γ$	0.01
Standard deviation minimum $v a r_{m i n}$	0.15
Standard deviation decay $k_{v a r}$	0.995

Table 5. DQN hyperparameter settings.

Hyperparameter	Value
Neurons in hidden layers	32, 32
Sample batch size	32
Discount factor $γ$	0.01
Learning rate	0.001
Replay pool size $D$	2000
Update step $C_{s t e p}$	200
Training round $E p i s o d e s$	20

Table 6. Energy consumption confidence interval analysis (CI) for different methods across 10 years.

Year	Energy Consumption Confidence Interval Analysis (kWh)
Year	RBC	DDPG	DQN	FS-DDPG	MBC
1	(2030.85, 2450.54)	(2743.08, 2955.51)	(3062.25, 3274.68)	(2482.97, 2890.17)	(814.68, 872.57)
2	(2362.05, 2851.11)	(2457.47, 2691.95)	(2843.67, 3078.15)	(2258.52, 2699.19)	(812.85, 876.83)
3	(1692.23, 2046.08)	(1832.25, 2025.76)	(2594.17, 2807.03)	(1739.33, 1916.79)	(795.48, 852.10)
4	(1654.64, 2025.36)	(1399.21, 1618.03)	(2323.62, 2564.32)	(1113.18, 1221.68)	(778.72, 838.32)
5	(1641.67, 2241.42)	(1008.48, 1289.48)	(1841.49, 2122.49)	(1003.07, 1144.15)	(622.35, 690.82)
6	(1489.11, 2112.28)	(982.09, 1253.88)	(1495.38, 1794.35)	(864.03, 1024.02)	(561.55, 628.13)
7	(2407.06, 2845.89)	(880.16, 1036.82)	(1198.10, 1485.62)	(783.36, 909.26)	(496.16, 547.21)
8	(2015.07, 2442.22)	(1150.62, 1404.40)	(1706.29, 1985.44)	(1040.31, 1271.24)	(812.83, 865.38)
9	(2083.30, 2510.34)	(1117.06, 1295.12)	(1590.59, 1786.45)	(964.74, 1113.13)	(828.22, 884.86)
10	(2281.93, 2668.55)	(1104.88, 1315.05)	(1642.64, 1852.81)	(925.97, 1034.22)	(851.87, 902.40)

Table 7. Annual water and air flow action consistency analysis.

Year	AC
	Water Flow		Air Flow
	DDPG	FS-DDPG	DDPG	FS-DDPG
1	45.00%	46.13%	39.87%	41.60%
2	45.66%	46.03%	42.95%	43.74%
3	48.93%	52.52%	44.12%	48.09%
4	57.10%	71.43%	57.11%	59.76%
5	73.72%	83.99%	73.43%	74.79%
6	84.78%	86.41%	84.46%	87.44%
7	83.66%	86.74%	85.65%	90.48%
8	82.77%	85.15%	83.97%	84.78%
9	83.34%	88.70%	85.03%	88.71%
10	83.74%	89.96%	86.59%	89.23%

Table 8. Comparison of DDPG and FS-DDPG in the fluctuation of water flow and air flow.

Year	$The Amount of Fluctuating Flow (G$ > 1)
	Water Flow			Air Flow
	DDPG	FS-DDPG	Reduced Proportion	DDPG	FS-DDPG	Reduced Proportion
1	1038	496	52.22%	1092	525	52.04%
2	1136	487	57.13%	1095	571	47.89%
3	1184	539	54.48%	1135	551	51.47%
4	735	447	39.18%	882	483	45.29%
5	472	172	63.56%	530	237	55.66%
6	359	61	83.01%	464	133	71.32%
7	366	57	84.43%	363	79	78.25%
8	640	9	98.59%	712	28	96.07%
9	595	4	99.33%	677	18	97.34%
10	430	11	97.44%	538	26	95.17%
11 (Test Set)	557	10	98.20%	645	27	95.82%

Table 9. The variation trend of the annual satisfaction rate of 5 methods.

Year	SR
	DDPG	FS-DDPG	RBC	MBC	DQN
	DDPG	FS-DDPG	DDPG	FS-DDPG	DQN
1	15.51%	17.13%	58.19%	92.89%	12.84%
2	26.72%	25.02%	59.34%	93.45%	22.83%
3	37.67%	38.52%	60.04%	93.24%	28.34%
4	46.94%	49.43%	59.48%	92.96%	39.12%
5	65.79%	70.99%	59.12%	92.67%	53.64%
6	74.78%	77.41%	59.79%	94.11%	59.79%
7	78.66%	83.74%	60.51%	92.82%	68.51%
8	84.17%	85.15%	58.96%	93.05%	71.96%
9	83.34%	83.16%	59.22%	94.25%	70.22%
10	84.74%	84.706%	60.16%	92.74%	72.16%
11 (Test Set)	84.18%	84.33%	59.77%	93.46%	69.77%

Table 10. Performance comparison of FS-DDPG with different penalty term weights and constraint tightening thresholds.

$Penalty Term Weights λ$	$Constraint Tightening Thresholds β$	Satisfaction Rate (SR)	Energy Consumption	Fluctuation Reduction Ratio (Compared to DDPG)
$Penalty Term Weights λ$	$Constraint Tightening Thresholds β$	Satisfaction Rate (SR)	Energy Consumption	Water flow	Air flow
10.0	1.0	84.33%	1080.75 kWh	98.20%	95.82%
5.0	1.0	84.22%	1157.91 kWh	84.18%	76.69%
15.0	1.0	82.35%	1201.93 kWh	99.57%	98.81%
10.0	0.5	82.17%	1278.47 kWh	99.83%	99.15%
10.0	2.0	83.94%	1137.62 kWh	65.72%	59.09%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Fu, Q.; Chen, J.; Lu, Y.; Wang, Y.; Wu, H. FS-DDPG: Optimal Control of a Fan Coil Unit System Based on Safe Reinforcement Learning. Buildings 2025, 15, 226. https://doi.org/10.3390/buildings15020226

AMA Style

Li C, Fu Q, Chen J, Lu Y, Wang Y, Wu H. FS-DDPG: Optimal Control of a Fan Coil Unit System Based on Safe Reinforcement Learning. Buildings. 2025; 15(2):226. https://doi.org/10.3390/buildings15020226

Chicago/Turabian Style

Li, Chenyang, Qiming Fu, Jianping Chen, You Lu, Yunzhe Wang, and Hongjie Wu. 2025. "FS-DDPG: Optimal Control of a Fan Coil Unit System Based on Safe Reinforcement Learning" Buildings 15, no. 2: 226. https://doi.org/10.3390/buildings15020226

APA Style

Li, C., Fu, Q., Chen, J., Lu, Y., Wang, Y., & Wu, H. (2025). FS-DDPG: Optimal Control of a Fan Coil Unit System Based on Safe Reinforcement Learning. Buildings, 15(2), 226. https://doi.org/10.3390/buildings15020226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FS-DDPG: Optimal Control of a Fan Coil Unit System Based on Safe Reinforcement Learning

Abstract

1. Introduction

2. Related Work

3. Case Study

3.1. System Operation Overview

3.2. Building Room Model and Simulating Cooling Load

3.3. Modeling of Fan Coil

4. Methodology and Control Process

4.1. CMDP Modeling

4.2. Action Constraint

4.2.1. Punishment Constraint Method

4.2.2. Action Restriction Method

4.3. FCU Process Based on SRL FS-DDPG

4.4. FS-DDPG Algorithm Based on SRL

5. Experimental Results and Analysis

5.1. Experimental Parameter Setting

5.2. Experimental Comparison Method

5.3. Experimental Results

5.3.1. Analysis of Algorithm Convergence

5.3.2. Energy Consumption Analysis

5.3.3. System Operation Performance Analysis

5.3.4. System Safety Analysis

5.3.5. Satisfaction Analysis

5.3.6. Sensitivity Analysis

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI