**1. Background and Motivation**

Along with the comfortable environment, the air-conditioning brings a problem of energy consumption. According to the U.S. Energy Information Administration, the energy consumption of buildings for residential and commercial users accounts for 20.1% of the global energy consumption worldwide. Of this amount, the buildings' heating, ventilation, and air conditioning (HVAC) systems can account for up to 50% of total building energy demand [1,2]. To minimize overall system energy input or operating cost while still maintaining the satisfying indoor thermal comfort and healthy environment all kinds of control algorithms are employed on HVAC. In [3], model-based monitoring of the occupants' thermal state by predictive controlling was proposed. In [4], the supervisory and optimal controls were summarized and classified into the model-based methods which used physical models, gray-box models, and black-box models, and the model-free category which used expert systems and pure learning approaches. In [5] the existed methods were grouped into the hard control (HC), such as basic controls involving PID control, optimal control, nonlinear control, robust or H∞ control, and adaptive control, and the soft control (SC) involving neural networks (NNs), fuzzy logic (FL), genetic algorithms (GAs), and other evolutionary methods. Moreover, hybrid control resulting from the fusion of SC and HC may achieve better performance.

Most of today's air-conditioning systems work based on the principle of compression refrigeration in which refrigeration is the core of the air-conditioning system [6]. The theoretical optimal conversion efficiency is used in the design of air-conditioning based on the specified loads, according to refrigeration theory. However, it is inevitable in practice for performance degradation due to the inconsistencies between design requirements and operational conditions. For example, a public place will provide a little refrigerating output to respond to small load at night, meanwhile it will need a large number of refrigerating output for the varying flow of people. The conventional control algorithm will result in the low operational conversion efficiency in this time-varying and uncertain condition because as we know the efficiency of compression refrigeration is confined to the working point and affected by the loads.

With the development of modern electronic and measurement technologies, such as supervisory control and data acquisition (SCADA), and smart sensors, the refrigeration system is prone to obtain abundant data which provide the information about consistency in the current refrigeration states and the operational environment. It is promising for the data-driven methods to improve the conversion efficiency by directly using these data. The reinforcement learning (RL) is a famous data-driven method which is regarded as a model-free supervisory control method in [5] and a soft control (SC) in [6], is about learning from interaction and how to behave in order to achieve a goal. The basic idea of reinforcement learning is simply to capture the most important aspects of an agent, which includes the sensation, the action, and the goal [7]. A reward rule is then employed to encourage the agent to seek the goal by adjusting its action with exploration and exploitation. Finally, the agent owns the optimal action to adapt to the surroundings by trials and errors. In this way, the RL can achieve optimal action based on the current states and environment. Along with incredible success in games [8], the RL has attracted great interest in various industries, such as robot [9,10], fault detection [11], and fault-tolerant control [12,13].

The RL has been introduced to HVA and some excellent results have been published recently. Baeksuk Chu et al. proposed an actor-critic architecture and natural gradient method with an improvement of the efficiency by the recursive least-squares (RLS) method aiming at two objectives: Maintaining an adequate level of pollutants and minimizing power consumption [14]. Pedro Fazenda et al. followed the idea of a multi-objective optimal supervisory control and studied the application of a discrete and continuous reinforcement-learning-based supervisory control approach, which actively learned how to appropriately schedule thermostat temperature set-points [15]. In [16], a multi-grid method of Q-learning was addressed to handle the problem that online reinforcement learning often suffered from slow convergence and faults in early stages. In [17], a suitable learning architecture was proposed for an intelligent thermostat that comprised a number of different learning methods, each of which contributed to creating a complete autonomous thermostat capable of controlling an HVAC system and proposed a novel state action space formalism to enable RL successfully. In [18], a ventilated facade with PCM was controlled using a reinforcement learning algorithm. In [19], a reinforcement learning control (RLC) approach was presented for optimal control of low exergy buildings. An improved reinforcement learning controller was designed in [20] to obtain an optimal control strategy of blinds and lights, which provided a personalized service via introducing subject perceptions of surroundings gathered by a novel interface as the feedback signal. In [21], a model-free actor-critic reinforcement learning (RL) controller was addressed using a variant of artificial recurrent neural networks called long-short-term memory (LSTM) networks. A model-free Q-learning was addressed in [22] that made optimal control decisions for HVAC and window systems to minimize both energy consumption and thermal discomfort. Recently, the latest methods on RL, for example, deep reinforcement learning control [23,24], regularized fitted Q-iteration approach [25], proximal actor critic [26], and auto-encoder [27,28] were appeared in air-conditioning to minimize both energy consumption and thermal discomfort. However, all the studies do not involve the time-varying and uncertain loads.

The RL is classified as online learning and offline learning, according to the resource of training data [29]. The online learning will make sense in the refrigeration system for the reason of overcoming the loads time-varying and uncertainty. There are two challenges: (1) The new states should be obtained only after the actions have acted on the system. It is a risk for the system because no one knows the results of these actions. (2) The conventional RL uses an errors and trials method to train, which is a forced way to the unknown environment but with low efficiency due to little information between states and actions. In this study, a coarse model is proposed to help solve both problems. We

consider a coarse model based on a fact that the human has grasped the principles of refrigeration conversion. It is possible to build a coarse model to discover the relations of state variables according to the principles based on the figures and charts through a large number of experiences. On the one hand, it will bring more information between states and actions to speed up the learning process. On the other hand, it will forecast the effects of actions as a simulation. We give up the accurate model for the reasons of the system complexity and surrounding disturbances. In addition, the RL only needs the reward as a basis of action adjustment in the process of seeking for the goal. The reward can be obtained from the tendency of performance indictor which reduces to the requirement of model accuracy. As a result, it is enough for a coarse model to be used for RL. Based on a coarse model, the computer simulation can be approximated to estimate the next states instead of the real system which reduces the risk of unknown states after a trial action.

Motivated by the above discussions, a novel approach combines the reinforcement learning with a coarse model in which we take the conversation efficiency as a goal of reinforcement learning, the measurement data of refrigeration as the sensation of agent, and the control valve of refrigeration as the action of agent, and learn by Q-algorithm, is proposed to improve the conversion efficiency of refrigeration and the advantages are summarized as follows:


The remaining parts of this paper are organized as follows: The aim of the study and the preliminaries on a compression refrigerating system are introduced in Section 2. The fundamentals of reinforcement learning algorithms are reviewed in Section 3. Based on Sections 2 and 3, a reinforcement learning control algorithm for compression refrigerating systems is proposed in Section 4, and the case studies are addressed in Section 5. The paper is ended by conclusions in Section 6.

#### **2. Preliminaries**

#### *2.1. Aim of the Research*

For a compression refrigerating system, the compressor drives refrigerants to a circulatory flow to transfer the heating of cryogenic medium to the high-temperature medium via a phase change of refrigerant. In this process, the temperature is the key indicator of reflecting efficiency which is a comprehensive result of the interaction of mass, pressure, process, and loads, so the coefficient of performance (COP) at time k is defined as

$$\text{COP}(\mathbf{k}) = \frac{\mathbf{T\_c(k)}}{\mathbf{T\_c(k) - T\_c(k)}} \tag{1}$$

where Tc and Te are the evaporation temperature and the condensation temperature at sampling k, respectively. Our aim is to maximize the coefficient of performance of the refrigeration cycle during a period of time *l*

$$\text{COP} = \max \left\{ \sum\_{\mathbf{k}=1}^{l} \frac{\mathbf{T}\_{\mathbf{c}}(\mathbf{k})}{\mathbf{T}\_{\mathbf{c}}(\mathbf{k}) - \mathbf{T}\_{\mathbf{e}}(\mathbf{k})} \right\} \tag{2}$$

#### *2.2. Principle of Compression Refrigerating System*

A compression refrigerating system is always made up of four elements: The compressor, the evaporator, the condenser, and the throttle mechanic. The compressor provides power for the

circulation of the refrigerant. The evaporator is employed to implement the heating change process between the refrigeration and the chilled water. The condenser is used to complete the heating change process between the refrigeration and the cooling water or air cooling system. The throttle mechanic (mainly electronic expansion valve) is applied to adjust the mass flow rate of the refrigerant in the circulation. Here, the principle of each element is reviewed for the integrity according to [30–33] and more details can be found in [30–33] and the references therein.

#### 2.2.1. Principle of the Evaporator

The scheme of the evaporator process is dictated in Figure 1.

**Figure 1.** The scheme of the two-phase zone.

The inner region of the evaporator is divided into a two-phase zone where the refrigerant lives in the combination of liquid and gas and a superheated zone where the liquid refrigerant evaporates into a gas with absorbing the heating and finally becomes the superheated steam. The refrigerant is moving from the two-phase zone to the superheated zone. An average void fraction correlation is used to express the proportion of gas in the two-phase zone whose definition is an integrating of void-age along the pipeline direction

$$\overline{\gamma} = \frac{1}{\chi\_2 - \chi\_1} \int\_{\chi\_1}^{\chi\_2} \gamma(\mathbf{x}) \mathbf{d}(\mathbf{x}) \tag{3}$$

where γ is a void fraction correlation with a form of formula (4)

$$\gamma = \frac{1}{1 + (\frac{1-\chi}{\chi})(\frac{\rho\_{\rm g}}{\rho\_{\rm l}})\mathcal{S}}\tag{4}$$

in which x is the vapor quality, ρ<sup>g</sup> and ρ<sup>l</sup> are the density of saturated steam and the density of saturated liquid, respectively, kg/m3, S is the slip ratio.

According to the law of mass conservation, the law of momentum conservation and the law of energy conservation, one will obtain the state equation of evaporator as formula (5)

$$\mathbf{E(x\_e)\dot{x}\_e = f(x\_e, u\_e)}\tag{5}$$

E11 E12 000 E21 E22 E23 0 0 E31 E32 E33 0 0 0 0 0E44 0 E51 0 0 0E55 ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ . Le1 . Pe . heo . Tew1 . Tew2 ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ . mei(hei <sup>−</sup> hg) + <sup>α</sup>ei1Aei( Lei1 Letotal)(Tew1 <sup>−</sup> Ter1) . meo(hg <sup>−</sup> heo) + <sup>α</sup>ei2Aei( Lei2 Letotal)(Tew2 <sup>−</sup> Ter2) . mei <sup>−</sup> . meo αei1Aei(Ter1 − Tew1) + αeoAeo(Tchw − Tew1) αei2Aei(Ter2 − Tew2) + αeoAeo(Tchw − Tew2) ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

$$\mathbf{E}\_{11} = \rho\_1 (\mathbf{h}\_1 - \mathbf{h}\_\emptyset) (1 - \overline{\gamma}) \mathbf{A}\_{\mathrm{ess}}$$

where

⎡

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

$$\mathbf{E}\_{12} = \left[ \left( \frac{\mathrm{d}\rho\_1 \mathrm{h}\_1}{\mathrm{d}\mathrm{P}\_{\mathrm{e}}} - \frac{\mathrm{d}\rho\_1}{\mathrm{d}\mathrm{P}\_{\mathrm{e}}} \mathrm{h}\_{\mathrm{g}} \right) (1 - \overline{\gamma}) + \left( \frac{\mathrm{d}\rho\_{\mathrm{g}} \mathrm{h}\_{\mathrm{g}}}{\mathrm{d}\mathrm{P}\_{\mathrm{e}}} - \frac{\mathrm{d}\rho\_{\mathrm{g}}}{\mathrm{d}\mathrm{P}\_{\mathrm{e}}} \mathrm{h}\_{\mathrm{g}} \right) (\overline{\gamma} - 1) \right] \mathbf{A}\_{\mathrm{e} \mathrm{cs}} \mathrm{L}\_{\mathrm{e} 1}$$

E21 <sup>=</sup> <sup>ρ</sup>e2 hg <sup>−</sup> he2 Aecs E22 = ⎡ ⎢⎢⎢⎢⎣ ⎛ ⎜⎜⎜⎜⎝ ∂ρe2 ∂Pe he2 − 1 2 ⎛ ⎜⎜⎜⎜⎝ ∂ρe2 ∂he2 pe ⎞ ⎟⎟⎟⎟⎠ dhg dPe ⎞ ⎟⎟⎟⎟⎠ he2 − hg + 1 2 dhg dPe ρe2 − 1 ⎤ ⎥⎥⎥⎥⎦ *A*ecsLe2 E23 = ⎡ ⎢⎢⎢⎢⎣ 1 2 ⎛ ⎜⎜⎜⎜⎝ ∂ρe2 ∂he2 pe ⎞ ⎟⎟⎟⎟⎠ he2 − hg <sup>+</sup> <sup>ρ</sup>e2 2 ⎤ ⎥⎥⎥⎥⎦ AecsLe2 E31 <sup>=</sup> ρ<sup>g</sup> <sup>−</sup> <sup>ρ</sup>e2 + ρ<sup>1</sup> − ρ<sup>g</sup> (1 − γ) Aecs E32 = ⎡ ⎢⎢⎢⎢⎣ ⎛ ⎜⎜⎜⎜⎝ ∂ρe2 ∂Pe he2 + 1 2 ⎛ ⎜⎜⎜⎜⎝ ∂ρe2 ∂he2 pe ⎞ ⎟⎟⎟⎟⎠ dhg dPe ⎞ ⎟⎟⎟⎟⎠ Le2 + dρ<sup>1</sup> dPe (1 − γ) + dρ<sup>g</sup> dPe γ Le1 ⎤ ⎥⎥⎥⎥⎦ Aecs E33 <sup>=</sup> <sup>1</sup> 2 ⎛ ⎜⎜⎜⎜⎝ ∂ρe2 ∂he2 pe ⎞ ⎟⎟⎟⎟⎠ AecsLe2, E34 <sup>=</sup> CpρV ew, E51 = − CpρV ew Tew2 − Tew1 Le2 , E55 <sup>=</sup> CpρV ew

The meanings of states and parameters are seen in Table 1.



#### 2.2.2. Principle of the Condenser

The condenser inner region is divided into a two-phase zone, a superheated zone, and a supercool zone. For each zone, the partial differential equations have been built according to the law of conservation of mass, the law of conservation of momentum and the law of conservation of energy based on the properties of refrigerants and the experience equations. Therefore, the state equation of the condenser, obtained by integrating along the pipeline direction, is given as follows:

$$\mathbf{C}(\mathbf{x}\_{\mathbf{c}})\dot{\mathbf{x}}\_{\mathbf{c}} = \mathbf{f}(\mathbf{x}\_{\mathbf{c}}, \mathbf{u}\_{\mathbf{c}}) \tag{6}$$

⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ C11 0 C13 0000 C21 C22 C23 C24 000 C31 C32 C33 C34 000 C41 C42 C43 C44 000 C51 0 0 0C55 0 0 0 0 0 0 0C66 0 C71 0 0 0 0 0C77 ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ . Lc1 . Lc2 . Pc . hco . Tcw1 . Tcw2 . Tcw3 ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ = ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ . mci hci − hg <sup>+</sup> <sup>α</sup>ci1Aci Lc1 Lctotal (Tcw1 <sup>−</sup> Tcr1) . mcihg <sup>−</sup> . mcohl <sup>+</sup> <sup>α</sup>ci2Aci Lc2 Lctotal (Tcw2 <sup>−</sup> Tcr2) . mco(hl <sup>−</sup> hco) <sup>+</sup> <sup>α</sup>ci3Aci Lc3 Lctotal (Tcw3 <sup>−</sup> Tcr3) . mci <sup>−</sup> . mco αci1Aci(Tcr1 − Tcw1) + αcoAco(Ta − Tcw1) αci2Aci(Tcr2 − Tcw2) + αcoAco(Ta − Tcw2) αci3Aci(Tcr3 − Tcw3) + αcoAco(Ta − Tcw3) ⎤ ⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

where

C11 <sup>=</sup> <sup>ρ</sup>c1 hc1 − hg Aecs C13 = ∂ρe1 ∂Pc hc1 hc1 + 1 2 dhg dPc ∂ρc1 ∂Pc1 Pc hc1 − hg <sup>+</sup> <sup>ρ</sup>c1 − 1 AccsLc1 C21 <sup>=</sup> ρc1hg − ρc3h1 Aecs, C22 <sup>=</sup> ρghg <sup>−</sup> <sup>ρ</sup>1h1 γ + (ρ<sup>1</sup> − ρc3)h1 Accs C23 = + <sup>∂</sup>ρc1 ∂Pc hc1 + <sup>1</sup> 2 ∂ρc1 ∂Pc1 Pc dhg dPc , hgAccsLc1 + dρ1h1 dPc (<sup>1</sup> <sup>−</sup> <sup>γ</sup>) + dρghg dPe (<sup>γ</sup> <sup>−</sup> <sup>1</sup>) AccsLc2 + + <sup>∂</sup>ρc3 ∂Pc hc3 + <sup>1</sup> 2 ∂ρc3 ∂Pc3 Pc dh1 dPc , h1AccsLc3 

C24 <sup>=</sup> <sup>1</sup> 2 ∂ρc3 ∂hc3 Pc h1AccsLc3, C31 = ρc3(h1 − hc3)Accs, C32 = ρc3(h1 − hc3)Accs C33 = ∂ρc3 ∂Pc hc3 + 1 2 dh1 dPc ∂ρc3 ∂Pc3 Pc (hc3 <sup>−</sup> h1) <sup>+</sup> 1 2 dh1 dPc ρc3 − 1 AccsLc3 C34 = 1 2 ∂ρc3 ∂hc3 Pc (hc3 − h1) + 1 2 ρc3 AccsLc3 C41 <sup>=</sup> (ρc1 <sup>−</sup> <sup>ρ</sup>c3)Accs, C42 <sup>=</sup> ρ<sup>g</sup> <sup>−</sup> <sup>ρ</sup><sup>1</sup> γ + (ρ<sup>1</sup> − ρc3) Accs C43 = + <sup>∂</sup>ρc1 ∂Pc hc1 + <sup>1</sup> 2 ∂ρc1 ∂hc1 Pc dhg dPc , AccsLc1 + dρ<sup>1</sup> dPc (<sup>1</sup> <sup>−</sup> <sup>γ</sup>) + dρ<sup>g</sup> dPe γ AccsLc2 + + <sup>∂</sup>ρc3 ∂Pc hc3 + <sup>1</sup> 2 ∂ρc3 ∂hc3 Pc dh1 dPc , AccsLc3 C44 <sup>=</sup> <sup>1</sup> 2 ∂ρc3 ∂hc3 Pc AccsLc3, C51 <sup>=</sup> CpρV cw Tcw2 − Tcw1 Lc1 , C55 <sup>=</sup> CpρV cw C66 <sup>=</sup> CpρV cw, C71 <sup>=</sup> CpρV cw Tcw2 − Tcw3 Lc3 , C72 <sup>=</sup> CpρV cw Tcw2 − Tcw3 Lc3 , C77 <sup>=</sup> CpρV

The meanings of the states and parameters are provided in Table 1.

#### 2.2.3. Principle of the Compressor

The refrigerant mass flow rate of the compressor outlet is

$$
\dot{\mathbf{m}}\_{\text{com}} = \text{ f} \mathbf{\eta}\_{\text{vol}} \mathbf{V}\_{\text{com}} \mathbf{\rho}\_{\text{g}} \tag{7}
$$

 cw

where . mcom is the refrigerant mass flow rate of the compressor outlet, kg/s, f is the working frequency, Hz, Vcom is the theoretical gas delivery determined by the producer, m3/s, ρ<sup>g</sup> is the density of

refrigerant at the entrance of compressor, kg/m3, ηvol is the volume efficiency of the compressor which follows the formula (8)

$$
\eta\_{\rm vol} = \left[0.98 - 0.085[\left(\frac{P\_{\rm c}}{P\_{\rm c}}\right)^{\frac{1}{k}} - 1\right] \tag{8}
$$

where Pc and Pe are the pressure of condensation and evaporation, respectively, k is the polytrophic exponent.

#### 2.2.4. Principle of the Electronic Expansion Valve

The mass flow of the electronic expansion valve is

$$
\dot{\mathbf{m}}\_{\text{e}} = \mathbf{C}\_{\text{v}} \mathbf{k} \sqrt{\rho\_{\text{l}} (\mathbf{P}\_{\text{c}} - \mathbf{P}\_{\text{e}})} \tag{9}
$$

where Cv and k is the flow coefficient and opening of the expansion valve, ρ<sup>l</sup> is the density of refrigerant at the entrance of expansion valve, kg/m3.

#### **3. Reinforcement Learning**

#### *3.1. Reinforcement Learning*

Reinforcement learning has been paid much attention during the past decades [34,35], whose basics are reviewed as follows. Consider the Markov decision process MDP (X, A, P, R), where X is a set of states and A is a set of actions or controls. The transition probabilities P : ×X × A × X → [0, 1] represent each state xX and action aA, the conditional probability P(x(k + 1), x(k), a(k)) = Pr x(k + 1) <sup>x</sup>(<sup>k</sup> <sup>+</sup> <sup>1</sup>), a(k) of transitioning to state x(k + 1)X where the MDP is in state x(k) and takes action a(k). The cost function R : X×A×X→R is the expected immediate cost Rk(x(k + 1), x(k), a(k)) paid after transition to state x(k + 1) ∈ X, given that the MDP starts in state x(k) ∈ X and takes action a(k) ∈ U. The value of a policy V<sup>π</sup> <sup>k</sup> (x(k)) is defined as the conditional expected value of the future cost Eπ k+<sup>T</sup> <sup>i</sup> <sup>=</sup> <sup>k</sup> <sup>γ</sup>i<sup>−</sup>kRi , R<sup>i</sup> ∈ R when starting in state x(k) at time k and following policy π(x, a), The target is to find the optimal actions that maximize the value of the future cost V<sup>π</sup> <sup>k</sup> (x)

$$\begin{split} \mathbf{V}\_{\mathbf{k}}^{\pi}(\mathbf{x}) &= \mathbb{E}\_{\pi} \left\{ \sum\_{i=1}^{k+T} \gamma^{i-k} \mathbf{R}\_{i} \right\} \\ &= \sum\_{\mathbf{a}} \pi(\mathbf{x}, \mathbf{a}) \sum\_{\mathbf{x}(\mathbf{k}+1)} \mathbf{P}(\mathbf{x}(\mathbf{k}+1), \mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) [\mathbf{R}\_{\mathbf{k}}(\mathbf{x}(\mathbf{k}+1), \mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) \\ &\qquad + \gamma \mathbf{E}\_{\pi} \left\{ \sum\_{i=-k+1}^{k+T} \gamma^{i-(k+1)} \mathbf{R}\_{i} \right\} \Big| \\ &= \sum\_{\mathbf{a}} \pi(\mathbf{x}, \mathbf{a}) \sum\_{\mathbf{x}(\mathbf{k}+1)} \mathbf{P}(\mathbf{x}(\mathbf{k}+1), \mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) [\mathbf{R}\_{\mathbf{k}}(\mathbf{x}(\mathbf{k}+1), \mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) + \gamma \mathbf{V}\_{\mathbf{k}+1}^{\pi}(\mathbf{x}(\mathbf{k}+1))] \end{split} \tag{10}$$

with T = ∞, the value function V<sup>π</sup> <sup>k</sup> (x) for the policy π(x, a) satisfies the Bellman equation:

$$\begin{aligned} \mathbf{V}\_{\mathbf{k}}^{\pi}(\mathbf{x}) &= \sum\_{\mathbf{a}} \pi(\mathbf{x}, \mathbf{a}) \sum\_{\mathbf{x}(\mathbf{k}+1)} \mathbf{P}(\mathbf{x}(\mathbf{k}+1), \mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) [\mathbf{R}\_{\mathbf{k}}(\mathbf{x}(\mathbf{k}+1), \mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) \\ &+ \gamma \mathbf{V}\_{\mathbf{k}+1}^{\pi}(\mathbf{x}(\mathbf{k}+1))] \end{aligned} \tag{11}$$

To a deterministic system <sup>a</sup> πk(x, a) <sup>x</sup>(k+1) P(x(k + 1), x(k), a(k)) = 1, so the optimal actions can be gained by alternating the policy evaluation (12) and policy improvement (13) according to the following two equations:

$$\mathbf{V}\_{\mathbf{k}}(\mathbf{x}) = \mathbf{R}\_{\mathbf{k}}(\mathbf{x}(\mathbf{k}+1), \mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) + \gamma \mathbf{V}\_{\mathbf{k}}(\mathbf{x}(\mathbf{k}+1)) \tag{12}$$

*Processes* **2019**, *7*, 967

$$\pi\_{\mathbf{k}}(\mathbf{x}, \mathbf{a}) = \underset{\pi}{\text{argmax}} \mathbf{R}\_{\mathbf{k}}(\mathbf{x}(\mathbf{k} + 1), \mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) + \gamma \mathbf{V}\_{\mathbf{k}}(\mathbf{x}(\mathbf{k} + 1)) \tag{13}$$

where γ is a discount factor, 0 ≤ γ < 1 in order to be convergent.

#### *3.2. Q-Algorithm*

In fact, formulas (12) and (13) cannot be solved directly because the need the information of Vk(x(k + 1)) that no one knows. A Q-algorithm proposed by Watkins [34] provides an effective solution by substituting function Q. The Q-algorithm defines the evaluation function Q(x(k), a(k)) as the maximum discounted cumulative reward that can be achieved from state x(k) and a(k) as the first action:

$$\mathbf{Q}(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) \stackrel{\text{def}}{=} \mathbf{R}\_{\mathbf{k}}(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) + \mathbf{V}^\*(\mathbf{x}(\mathbf{k} + 1)) \tag{14}$$

where the state x(k + 1) comes from x(k) and a(k), V<sup>∗</sup> (x(k + 1)) is the optimization of Vk(x(k + 1)) beginning with x(k + 1) and following the optimal actions.

Denote the optimum of Q as Q<sup>∗</sup> , therefore one has

$$\begin{aligned} \mathbf{Q}^\*(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) \\ = \max\_{\mathbf{a}(\mathbf{k})} [\mathbf{R}\_\mathbf{k}(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) + \mathbf{V}^\*(\mathbf{x}(\mathbf{k} + 1))] = \mathbf{R}^\*(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) + \mathbf{V}^\*(\mathbf{x}(\mathbf{k} + 1)) \\ = \mathbf{V}^\*(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) \end{aligned} \tag{15}$$

where the superscript ∗ expresses the optimal values.

It is seen from formula (15) that Q<sup>∗</sup> ((x(k), a(k)) is equivalent to V<sup>∗</sup> (x(k), a(k)) with the same action. Therefore, the optimal actions a∗(k) can be obtained by the value iteration following the formula:

$$\begin{aligned} \mathbf{Q}(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) &\leftarrow \mathbf{Q}(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) \\ + \mathbf{a}[\mathbf{R}(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) + \gamma \mathbf{m} \mathbf{a} \mathbf{Q}(\mathbf{x}(\mathbf{k}+1), \mathbf{a}(\mathbf{k})) - \mathbf{Q}(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) \, ] \end{aligned} \tag{16}$$

where α is a learning rate, 0 ≤ α < 1, and the states x(k + 1) will be got at the next sampling time k + 1. By using this iteration formula, it will finally converge to steady state by continuously adjusting the action a(k + 1) and one obtains the responding control a∗(k).

$$\mathbf{a}^\*(\mathbf{k}) = \operatorname\*{argmax}\_{\mathbf{a}} \mathbf{Q}(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) \tag{17}$$

A proof on strict convergence of Q-valued function (16) was given by Watskin [35], whose theorem is rewritten here as Lemma 1.

Define n<sup>i</sup> (x, a) as the index of the ith time that action a is tried in state *x*.

**Lemma 1 [35]:** *Given bounded rewards* |*Rn*| ≤ *C, learning rates* 0 ≤ α*<sup>n</sup>* < 1*, and* <sup>∞</sup> *<sup>i</sup>* <sup>=</sup> <sup>1</sup> <sup>α</sup>*ni*(*x*,*a*) <sup>=</sup> <sup>∞</sup>*,* <sup>∞</sup> *i* = 1 α*ni*(*x*,*a*) <sup>2</sup> <sup>&</sup>lt; <sup>∞</sup>,∀*x*, *a, then Qn*(*x*, *<sup>a</sup>*) <sup>→</sup> *<sup>Q</sup>*∗(*x*, *<sup>a</sup>*) *as <sup>n</sup>* → ∞*,* <sup>∀</sup>*x*, *<sup>a</sup> with probability 1.*

#### **4. The Proposed Approach**

The scheme of the proposed approach is indicted as Figure 2. It is a hierarchical control of an integer of the process level and the loop level. The process level targets the conversion efficiency of the refrigerating system by adjusting action variables whose optimization a<sup>∗</sup> is the reference of the loop level. The loop level as a basic control circuit implements the reference control by adjusting the control variables u whose elements are the compressor frequency and the electronic expansion valve opening, respectively.

**Figure 2.** The scheme of the proposed approach.

#### *4.1. Process Level*

#### 4.1.1. States Vector, Action Vector, and Cost Function

As discussed in Section 2.2, there are seven states in evaporation and five states in condensation with a complex description and high order nonlinearity. Some hypotheses are considered for the sake of simplicity. (1) The evaporator and the condenser are regarded as a two-phase zone with the extent of superheat in evaporator being estimated by the local energy conservation because the most area in heating change belongs to the two-phase zone, according to experiences. (2) Only the temperature of heating exchange is changing in the thermodynamic equation and only the length of the two-phase zone is changing in the mass conversation equation. (3) The void fraction is considered as a constant based on the fact of a small change in the working condition. (4) The heat transfer coefficient and the refrigerant physical parameter, which can be obtained by the identification technique are supposed to be constant though they are different depending on conditions. Based on the above hypotheses the states vector is reduced as

$$\mathbf{x} = \begin{bmatrix} \mathbf{x}\_1 & \mathbf{x}\_2 & \mathbf{x}\_3 & \mathbf{x}\_4 & \mathbf{x}\_5 \end{bmatrix}^T = \begin{bmatrix} \mathbf{T}\_\mathbf{e} & \mathbf{l}\_\mathbf{e} & \mathbf{T}\_\mathbf{e} & \mathbf{l}\_\mathbf{e} & \mathbf{T}\_\mathbf{wa} \end{bmatrix}^T \tag{18}$$

where le is the length of the two-phase zone in evaporator, m, Twa is the tube wall temperature of the condenser. It is important to point out that the temperature states can be measured directly by sensors and the length states can be obtained by soft-sensor with the degree of superheat.

The action vector of the process is

$$\mathbf{a} = \begin{bmatrix} \mathbf{m}\_{\mathbf{e}\prime} \mathbf{m}\_{\mathbf{com}} \end{bmatrix}^{\mathrm{T}} \tag{19}$$

where me and mcom are the mass flow of the expansion valve refrigerant and of the compressor refrigerant which are controlled by the opening of the expansion valve K and the frequency of the compressor f, according to formula (7) and formula (9), respectively.

The cost function is selected as the goal under an action a(k) ∈ U

$$\left. \mathbf{R\_k(x(k),a(k))} \right|\_{\mathbf{r}} = \left. \frac{T\_\mathbf{c}(k)}{T\_\mathbf{c}(k) - T\_\mathbf{c}(k)} \right|\_{\mathbf{a}(k)} \tag{20}$$

As a result, the optimal actions a∗(k) can be obtained by Q-algorithm following the formulas (16) and (17).

#### 4.1.2. Exploitation and Exploration

The RL gets the optimal action according to the reward from the interaction with the environment. There are two ways called the exploitation, which seeks the optimal action according to the best rewards of Q-value and the exploration, which seeks the optimal action by dispatching randomly at a certain probability in order to escape the local optimization of exploitation or find a new unknown optimization. The ε-greedy method is proposed to carry out the exploitation and the exploration with the form of formula (21)

$$\mathbf{a} = \begin{cases} \operatorname\*{argmax} \mathbf{Q}(\mathbf{s}, \mathbf{a}), & \mathbf{p} < \varepsilon \\ \operatorname\*{rand}(\mathbf{U}), & \mathbf{p} \ge \varepsilon \end{cases} \tag{21}$$

where ε is a pre-set probability (usually as a large probability event), p is the probability of action selection, and U is the action space.

#### 4.1.3. Procedure of the Proposed Algorithm

Based on the selection of action vector, state vector, and the cost function, the procedure of the proposed algorithm that begins with state vector x(k) is summarized as procedure 1.

Procedure 1.


$$\mathbb{Q}(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) = \mathbb{R}(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) + \gamma \min\_{\mathbf{a}(\mathbf{k}+1)} \mathbb{Q}(\mathbf{x}(\mathbf{k}+1), \mathbf{a}(\mathbf{k}+1)) \tag{22}$$


$$\mathbf{a}^\*(\mathbf{k}) = \underset{\mathbf{a}(\mathbf{k})}{\operatorname{argmax}} \mathbf{Q}(\mathbf{x}(\mathbf{k}), \mathbf{a}(\mathbf{k})) \tag{23}$$


#### *4.2. Loop Level*

The technology of loop control has been very mature and a simple PID is proposed to achieve this function for each loop whose framework is in Figures 3 and 4. Here mcom and me are the reference variables which come from the process level by RL. The u1 and u2 are the control variables which exert the effect on the system by the associated actuators that are parts of the compression refrigeration system. The y1 and y2 are the system outputs which are obtained by the sensors.

**Figure 3.** The framework of *mcom* loop.

**Figure 4.** The framework of *me* loop.

#### **5. Case Studies**

Our test-bed of a compression refrigeration system is seen in Figure 5, and its structure is dedicated in Figure 6 [36,37]. The liquid refrigerant stored in the tank goes into the heating interexchange through the desiccator, where they gasify by absorbing the heat. The gas comes back to the condenser driven by a compressor and becomes liquid in the stored tank. The exchanged cool air goes into the compound air conditioning plant to adjust the environment. The main devices and their specifications are listed in Table 2.



The temperature was obtained by the temperature transmitter of Pt100 with the range of 0–30 ◦C, 0–60 ◦C, and 0–100 ◦C under the degree of accuracy ±0.1 ◦C (class A). The temperature of the evaporator and condenser was limited within the range of −5–15 ◦C and 30–50 ◦C, respectively. The pressure was obtained by the pressure transmitter with the range of 0–2.5 MPa, 0–1.6 MPa under the degree of accuracy 0.25%. The electromagnetic flowmeters with the degree of accuracy 0.5% were added to measure the mcom and me, which were the action vectors of the process level. The traditional coefficient of performance (COP) was obtained by the stable state of the compression refrigeration

system. However, the loads are often uncertain which makes the compression refrigeration system stay in the transient process. As a result, a 30 kW tunable stainless steel electric heater was applied to simulate a thermal load with the range from 22.5 to 30 kW and the sampling time was selected as 1 min. In the experiment, we adjusted the electric heater by changing the resistance value through rotating an adjustment button. In this way, we simulated the load climb by running the adjustment button in one direction and the load uncertainty by random changing the direction.

**Figure 5.** The test-bed of a compression refrigeration system.

**Figure 6.** The structure of the compression refrigeration air-conditioning system.

#### *5.1. E*ff*ect of Model Accuracy on Performance*

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

A coarse model was built based on the principle of compression refrigeration in the process level with the following form: ⎧ .

$$\begin{aligned} \mathbf{T\_e = (c\_0 \mathbf{m\_{com}} + c\_1 \mathbf{l\_e (T\_w - T\_e)})} \\ \dot{\mathbf{l\_e = (c\_2 (m\_e - m\_{com)})}} \\ \dot{\mathbf{T\_c = c\_3 \mathbf{m\_{com}} + c\_4 \mathbf{l\_e (T\_{wa} - T\_c)}} \\ \dot{\mathbf{l\_c = c\_5 (m\_{com} - m\_e)}} \\ \dot{\mathbf{T\_{wa} = c\_6 (T\_c - T\_{wa}) + c\_7 (T\_{wa} - T\_a)}} \end{aligned} \tag{24}$$

where the variables of Te, Tw, Tc, le, lc,Twa, Ta, me and mcom are the same meaning in Table 1. The coefficients c0-c7 are obtained based on collected data by the technology of the system identification [33]. The values are seen in Table 3.

**Table 3.** The value of parameters in the coarse model [33].


Taking the values from Table 3 as the criterion (called the first coarse model) we built the second coarse model by changing the value of coefficient c0 from −1672 to −1500, the values with the rate 4.58%, and the value of coefficient c4 from 456 to 450. The control algorithm follows the procedure 1. The evolution of states, the instantaneous efficiency and the action variables are seen in Figures 7–9. The refrigerant circulated between gas and liquid with endothermic and exothermic reactions. The evaporation temperature was the external temperature of the refrigerant during the evaporation of the evaporator which usually is constantly pre-set according to the environmental requirement. It is seen from Figure 7 that the evaporation temperature Te (either of the first coarse model (blue curve) or of the second coarse model) fluctuated around 7 ◦C. The condensing temperature is prone to be affected by the working pressure. More energy changing will produce higher pressure which will increase the condensing temperature. It is seen from Figure 7 that the condensing temperature Tc (either of the first coarse model (blue curve) or of the second coarse model) was rising slowly to meet the requirement of the loads from 22.5 to 30 kW. Different from the conventional coefficient of performance (COP) which is defined in the steady state, our COP is extended to the whole process which includes both the transient and steady. The transient is often prevailing due to the resistance change. Therefore, the mean of the COP during a certain period of time is meaningful according to our definition. The curve of instantaneous efficiency in Figure 8 shows the differences between coarse models on the system. There was an instantaneous error between the first coarse model (blue curve) and the second coarse model (green curve). However, the average efficiency of both coarse models during 200 min was 1.2681 and 1.2685, respectively, with the error of 0.03%. We calculated the residual to evaluate the robustness of both coarse models. First, we made a polynomial fitting by the least square method with the same style of cubic curves (no significant error change over this order) according to the standard procedure [36] and obtained the estimation *COP*ˆ .

$$\text{COP} = p\_1 \cdot \text{COP}^3 + p\_2 \cdot \text{COP}^2 + p\_3 \cdot \text{COP} + p\_4 \tag{25}$$

where the coefficients are shown in Table 4.

**Figure 7.** The evolution of states Te and Tc.

**Figure 9.** The action variables.

**Table 4.** Coefficients of the polynomial fitting.


Therefore, the norm of residual (σ) is obtained according to the formula (26):

$$
\sigma = \sqrt{\frac{\sum\_{k=1}^{l} \left( \text{COP}(k) - \text{COP}(k) \right)^2}{l - 1}} \tag{26}
$$

where *l* is the number of examples. The responding norms of residual are σ<sup>1</sup> = 0.60575 and σ<sup>2</sup> = 0.67583, respectively, which means the coarse model has a good robustness due to the approximation norm of residual. As a result, we can evaluate the effect of model accuracy on the performance of efficiency by comparing the result of the COP with changing the value of parameters. The mean change of 0.03% compared with the parameter's change of 4.58% indicates only a small amount of the impact among the coarse models.

Figure 9 gives the action variables of me and mcom. It can be seen that there were noticeable fluctuations in me and mcom. It is due to the RL implementation and load variations. We adopted the classical table mode to map the relation between the states and the actions. However, limited by our computer's capability (Intel(R) Core(TM)2 Duo CPU E7300 @2.66GHz @2.67GHz RAM 4.00GB) the actions were divided into 1K pieces which causes some errors between the acquired actions and the ideal actions. The errors were somewhat enlarged in the process of evaluating the action value because we took each episode to last 200 steps instead of T = ∞. Another important factor is the influence of the load variations. The final actions of the RL algorithm were determined by the initial states and the current loads. The load variation was simulated by sequentially rotating the adjustment button which lead to the continuous transition.

#### *5.2. Comparison with the Conventional Approach*

There is no method to deal with the coefficient of performance (COP) under the temperature changing by load variation. The predominant approach in HVAC is to keep the outlet temperature matching the loads by automatically adjusting the flow with an electronic expansion valve and maintaining the constant speed of the compressor. We compared the proposed method with this conventional approach. Figures 10–12 show the evolution of states Te and Tc, the instantaneous efficiency, and the control variables. The blue curves and the red curves represent the results of the proposed approach and the conventional approach, respectively. It is seen from Figure 11 that the Tc and Te with the conventional approach and the proposed approach are a similar tendency to meet the requirement of loads during the test period. Te with the proposed approach shows large fluctuations due to the coarse model leading to a rough estimation of states. While the Te with the conventional approach shows a good smooth curve because it is under classical control. The Tc has small fluctuations because the accuracy of Tc is better.

**Figure 10.** The evolution of states Te and Tc.

**Figure 12.** The control variables.

There is a small fluctuation of instantaneous efficiency for the conventional approach because it tries to keep the temperature bias of inlet and outlet by taking control. Our proposed method releases the control of temperature within the range of limitation, so the instantaneous efficiency formulated from the Te and Tc shows bigger fluctuations, even sometimes low efficiency and sometimes high efficiency. We tried to reduce this fluctuation by adjusting the parameters. First, we changed the pre-set probability ε from 0.7 to 0.9, which is the general value for errors and trials method. However, there was no improvement of fluctuation and efficiency. Considering the requirement of real time, we did not decrease under 0.7, which means that 30% more searches were used to explore, but in fact, the new states can be computed according to the coarse model, which means it does not need a high probability to explore. Instead, we tried to increase the value of ε and get a good COP (1.34% better than the conventional method) even by setting ε as 0.99, though there was no improvement in the fluctuation. We also tried to adjust the learning rate α that is a compromise between the speed and the accuracy of training. If it is too small, the training will take more time. If it is too big. the training will reduce accuracy. There is no theoretical basis so far in reinforcement learning but with an empirical value of 0.95–0.98. It was noticed that the simulation shows the fluctuations were still present. A representative is shown in Figure 11. It is vague for instantaneous efficiency in Figure 11. However, as with an energy

conversion process, we are more concerned with refrigeration efficiency over a period of time. The average efficiency was proposed as the metrics. The average efficiency with the conventional approach (red curve) and with the proposed approach (blue curve) was 1.2513 and 1.2681, respectively. The proposed approach will raise the efficiency by 1.34% more than the conventional approach on average during the period of 200 min.

The control variables are shown in Figure 12. As we have pointed out in 4.2, the controller parameters of PID1 and PID2 should be predesigned. The noticeable fluctuations in our proposed method cause difficulties in regulating the parameters of PID. It is fortunate that the refrigeration is a slow second-order system and the constant parameters of PID have a wide feasible scope around the working point. The values of parameters are prepared group by group in the table based on the previous experiments by considering the system stability. Therefore, the controller will directly call the group of parameter values from this preparatory table to track the references of actions provided by the process level which are fluctuating due to the partition error of actions and the load variation.

### **6. Conclusions**

In this paper, we have discussed an online data-driven approach to improve the conversion efficiency of refrigeration system under the condition of load variation. A reinforcement learning approach that has an ability of seeking the unknown environment is proposed to find the optimal actions based on the online data in the process level and its risk on the system because the training process is complex and unknown in advance due to the warning from the computer simulation. A coarse model is developed to evaluate the action value which reduces the dependence of traditional control methods on model accuracy. Finally, the actions are achieved as the pre-set variables by implementing the single loop control. The simulation shows the proposed algorithm is better than the conventional methods in the conversion efficiency of the refrigeration system from the viewpoint of the average, although larger fluctuations are noticeable. How to reduce this fluctuation is our further study.

**Author Contributions:** Conceptualization and methodology, D.Z. and Z.G. Writing—original draft preparation, D.Z. Writing—review and editing, Z.G.

**Funding:** This research was funded by the National Science Foundation of China under grant 61673074.

**Acknowledgments:** The authors would like to acknowledge the research support from the School of Electrical Engineering and Automation at Tianjin University, the National Science Foundation of China under grant 61673074; and the Faculty of Engineering and Environment at the University of Northumbria, Newcastle.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
