1. Introduction
Multi-physical field sensing and control is a technology now widely used in various areas such as sensor networks, robotics, smart buildings, and instrumentations, to name a few [
1,
2,
3]. With the rapid development of artificial intelligence technology, machine learning has become a great potential strategy to execute measurement and control missions for versatile complex systems [
1]. Thermal and energy management is a general issue that has been commonly observed in electromechanical equipment, energy devices, constructions and so forth [
4,
5,
6]. Due to the easy operation and high efficiency, forced air cooling has been suggested as the most commonly used strategy for dealing with thermal control in areas such as plug-in hybrid electric vehicle [
7], data center [
8] and handheld polymerase chain reaction (PCR) device [
9]. The fundamental principle of forced air cooling is adjusting the airflow to regulate heat dissipation and thermal distribution in a bounded space. In the industry, the proportional-integral-derivative (PID) algorithm is a popular feedback control method and has been widely used in various thermal management systems. However, at the same time, it is not suitable for multi-input multi-output (MIMO) control problems [
10,
11,
12]. The MIMO temperature control problem is complex because of the strong coupling that exists in the controlled object. The modeling method is a common decoupling method. Li et al. [
13] introduced a decoupling method in a double-level air flow dynamic vacuum system based on the neural networks and the prediction principle. Gil et al. [
14] presented a constrained nonlinear adaptive model-based control framework applied to a distributed solar collector field. Shen et al. [
15] presented the temperature uniformity control of large-scale vertical quench furnaces with a proportional–integral–derivative (PID) decoupling control system to eliminate the strong coupling effects of multi-heating zones. Although the modeling method is an effective approach to dealing with the coupling effects in some cases, it cannot meet all the demands in a practical application. Specifically, the control performances of this method usually depend on the accuracy of the developed model, which mainly relies on professional experience and highly restricts the robustness of the system.
Miniaturization, modularization, and multi-functionalization have become major development trends in instrumentation [
16,
17,
18,
19]. When the equipment gets miniaturized and integrated, it is an important issue that the thermal control in a narrow space and energy consumption need to be well managed, especially for the devices integrating multiple functional modules which have different thermal characteristics [
4,
5,
19]. Generally, thermal control becomes arduous when working space turns smaller due to the strong thermal coupling [
20]. It is essentially a multi-input multi-output system.
The airflow generated by the natural or forced convection is heat-transfer carrier that modulates thermal behavior [
21]. As shown in
Figure 1a, thermal energy from a heat source is taken away by the airflow. The airflow rate can speed up by adjusting the fan’s rotate speed, in which case, more thermal energy can be taken away and the temperature field lowers down. This suggests that airflow dominates the spatial thermal behavior. An effective thermal and energy management relies on the regulation of both temperature and airflow. In this paper, we adopt micro bimodal sensors that can simultaneously detect spatial airflow and temperature to support a fast, robust, self-adaptive thermal and energy management.
Machine learning based control has attracted great attention in the recent years. Reinforcement learning has been one of the most remarkable approaches, which has been successfully applied to deal with various problems [
22,
23,
24,
25], such as game, multi-agent systems, swarm intelligence, statistics, and genetic algorithms. The main process of reinforcement learning is shown in
Figure 1b. First, the agent gets the state (s) and the reward (r) from the environment. The state represents the environment condition and the reward is a numerical signal. Second, the agent takes acts (a) onto the environment. The policy is learned by the agent itself. The environment changes its state with the effect of action and exports the reward that measures the action taken by the agent at the last state. Then, the agent adjusts its action policy for a better reward. Through constant exploration and trial and error, the agent figures out a policy to take action in different states [
26]. Reinforcement learning is an effective way to realize automatic control without human experiences [
27,
28]. In fact, it is a self-learning method by interacting with the environment by trial and error, and then self-adjusting the strategy to the actuator.
In this paper, we propose a novel control method to deal with thermal and energy management based on multi-physical field sensing and reinforcement learning as shown in
Figure 1c. The proposed methodology achieves a fast, robust, self-adaptive temperature control as well as energy management by using distributed bimodal airflow-temperature sensing and reinforcement learning. First, the distributed airflow-temperature sensors detect the airflow velocities and temperatures in the target space, which represent the environment states. Subsequently, reinforcement learning is introduced to evaluate the environment state and execute promptly control actions to cooling fans.
The remainder of this paper is organized as follows.
Section 2 introduces the fundamental principle of reinforcement learning. In
Section 3, the reinforcement learning control method based on airflow-temperature field sensing is presented. In
Section 4, the proposed method is applied to thermal and energy management of a multi-module integrated system. The experiment results and discussions are presented in
Section 5. The conclusion is drawn in
Section 6.
2. Overview of Reinforcement Learning
As mentioned above, reinforcement learning is a methodology aimed for better reward via trial and error exploration. The behavior of the agent is defined by a policy π. It is a probability distribution which maps states to actions π:
S →
P(
A), where
S denotes state space and
A denotes action space [
24]. The transition dynamics and reward function can be written as
p(
st+1|
st,
at) and
r(
st,
at), respectively. The expectation under policy π is denoted as E
π. And the reward from a state is defined as the sum of discounted future rewards [
22] by
where
γ is the discounting factor varying from 0 to 1. Different policy gains different
Rt and the goal of reinforcement learning is to learn a policy that maximizes the expectation E
π[
Rt].
The action-value function has been used in various reinforcement learning algorithms. It defines the expected reward after taking an action at in-state
st under policy π:
where
Qπ(
st,
at) denotes the value of state-action pair (
st,
at) following policy π [
23]. The recursive Bellman equation can be used for calculating
Qπ(
st,
at)
Assume adopting the deterministic policy, which can be described as a function μ: S ← A, the Bellman equation can be written as
Therefore, the expectation depends only on the interaction between the agent and the environment.
Q(
st,
at) is approximated with parameter
θQ, which can be optimized by minimizing the loss function [
22,
25,
26]:
Employ actor-critic approach based on the deterministic policy gradient, it mainly contains a parameterized actor function
μ(
st|
θμ), which maps states to actions, and a critic function
Q(
st,
at|
θQ), which describes the value of the state-action pair. The parameters in critic function are updated via the Bellman equation, while the actor’s parameters are updated by the chain rule [
29]:
3. Reinforcement Learning Control Method Based on Airflow-Temperature Field Sensing
As mentioned above, reinforcement learning is an effective approach to realize automatically adjusting the control strategy only by interacting with the environment, which does not need human intervention. Multi-dimensional information of environment can enhance the state estimation accuracy. One of the characteristics of reinforcement learning is to accumulate a large amount of exploration experiences to predict future rewards and guide the current actions.
The difficulties in realizing the on-line reinforcement learning mainly lie in two aspects. First, it is difficult to accurately estimate the future rewards by a short exploration period. Second, in some cases, it is hard to receive the rewards at every moment, for example, a score can’t be obtained until the end of the game. In real control, more concerns are given to whether the change direction of the action is correct rather than the accuracy of the reward itself. Aiming at the characteristics of the control system, the theoretical methods of reinforcement learning can be appropriately simplified to realize practical applications.
As shown in Equation (1), the reward from a state can be defined as the sum of discounted future rewards obtained from the environment. Making the estimating depth as 1 for indicating the action adjustment direction gives
where
r(
st,
at) denotes the received reward at time
t + 1 after taking action
at at state
st. Then there is
A key point in reinforcement learning mission is the choice of the reward
r(
st,
at). The reward function is related to the system performance. It is needed to convert the control object to the corresponding reward function. In real control systems, there is often more than one control objective to be achieved, such as minimizing energy consumption while meeting the thermal control accuracy. Without losing the generality, the above multi-objective control requirement can be expressed as
where
fi(
x) is an objective function with a threshold requirement,
Di represents a constraint condition, and
g(
x) is an objective function pursuing an extreme value. Then
r(
st,
at) can be expressed as
The control objective is converted to minimizing reward function r(st,at), where αi is the scale factor. fi(x) and g(x) are determined by the state of the environment and the definition of r(st,at) can be rewritten as
Based on the accurate prediction of the reward at time t + 1, the action strategy is adjusted automatically to make the reward function turn to the minimum.
Another two critical issues that need to be considered are the ways to get the state information and comprehending the mechanism of the reward affecting the action strategy. The state information includes various types of information related to the controlled object. The changes in controlled variables are related to multiple variables. The state information can be expressed as
where
xi and
yi denote different information types and
x0 and
y0 are their respective base values.
The mechanism of the reward affecting the action strategy determines the actor-critic approach based on the deterministic policy gradient and the selection of actor function
μ(
st|
θμ) and critic function
Q(
st,
at|
θQ). Two neural networks are proposed. One is the policy network and the other is the value network. The schematic diagram is seen in
Figure 2.
Policy network is used to form a behavior strategy. It acquires the state information st of the controlled object and exports control signal at to the actuators. The value network is used to evaluate the behavior strategy. It inputs the state st, action at, and outputs critic function value Q(st,at|θQ). The value network updates its parameters by minimizing the deviations between the output and the received reward, while the policy network updates its parameters to reduce the value network’s output by gradient descent. Therefore, through the continuous interaction with the controlled object and the learnings of the value and policy networks, r(st,at) gradually decreases.
The above method can be written as the following flowchart shown in
Figure 3. A random variable with a mean of 0 is added to the output of the policy network as the actual action, and its variance gradually reduces with time.
4. Application of On-Line Reinforcement Learning Method
As mentioned above, effective thermal control and energy management rely on the regulation of both temperature and airflow. We developed micro bimodal sensors that can simultaneously detect airflow velocity and temperature. The bimodal sensor is comprised of micromachined hot-film anemometer and thermistor. The airflow sensing relies on the convective heat transfer from the electrically heated hot-film to the surrounding air. When a hot-film is heated to a higher temperature than the surrounding, the heat transfer related to the airflow velocity dominates its resistance by the thermoelectricity of the hot-film [
30,
31,
32]. Therefore, the hot-film serves as an airflow detector. The temperature sensing is based on the thermoelectric conversion of the thermistor.
The circuit schematic diagram and the developed prototype of the bimodal sensor is shown in
Figure 4a, where a hot-film resistor (hot-film), a temperature sensor, a compensating resistor and two balance resistors comprise a Wheatstone bridge. The hot-film resistor is used to detect airflow. The temperature resistor is used to detect the ambient temperature and also provides the temperature compensation for the anemometer. The hot-film resistor is fabricated by Pt. The bimodal sensor is operated in a constant temperature difference (CTD) feedback circuit shown in
Figure 4a, which keeps the heating temperature of the hot-film resistor from the ambient temperature constant [
30,
31]. The compensating resistor
Rc is used to adjust the heating temperature of the hot-film resistor
Rh.
The characterization of the airflow sensor was conducted by using a wind tunnel experiment. The airflow rate was controlled by a mass flow controller (Fluke molbloc-L, Fluke Calibration, Everett, WA, USA). The relationship between the airflow velocity (denoted as
V) and the output voltage
U of the sensor was formulated as
U2 = a + b
Vn [
30,
31], where a, b, and
n are constants that were determined through the least squares estimation.
Figure 4b shows the output voltage
U against the airflow velocity.
The detected temperature can be deduced by the sensor outputs and calculated by
where
R0 is resistance value of
Rt at 0 °C,
αt is temperature coefficient of
Rt.
Characterization of the temperature sensor was conducted by putting the sensor in a temperature-controlled oven (Thermoscientific OGH60). The comparison of the temperature detected by the airflow-temperature sensor and the actual temperature is shown in
Figure 5a,b. The measurement error is less than 0.5 °C.
The schematic diagram of the on-line reinforcement learning control method for the thermal and energy management is shown in
Figure 6, where multiple airflow-temperature sensors were distributed to detect the airflow-temperature fields as the environment state in the control system.
Using the neural network approach mentioned in
Section 3, the value network is to evaluate the state and action pair. It maps the environment state and the action to reward. The value network gets a reward from the outputs of the controlled object and updates the network parameters to optimize the evaluation. The policy network exports the control commands to drive cooling fans according to the airflow and the temperature information. The parameters of the policy network are adjusted on the basis of the evaluation of the value network. The selection of the reward function is conducted by considering the accuracy of temperature control and the power consumption of the fans. The reward is formulated as
where
P(
t + 1) denotes the power consumption of the fans at time
t + 1.
Ti(
t + 1) and
Ri(
t + 1) denote the sampled and target temperature values of sensor
i at time
t + 1, respectively.
Di represents the requirement of temperature control precision and
αi is the factor that regulates the ratio of each control target.
P0 and
T0 are the basic values of power consumption and temperature respectively.