**3. Deep Reinforcement Learning Algorithm**

Model-free reinforcement learning is a technique for understanding and automating goal-directed learning and decision-making [33]. It differs from most other control algorithms in that it emphasizes on agents learning through direct interaction with the environment, without relying on model supervision or a complete environmental model [34]. As an interactive learning method, the main features of reinforcement learning are trial-and-error search and delayed return [35]. Figure 1 shows the interaction process between the agent and the environment. At any time step, the agent observes the environment in order to get the state *St* and then performs the action *At*. Afterward, the environment generates the next time *St*+<sup>1</sup> and *Rt* according to *At*. The probability that the process moves into its new state *St*+<sup>1</sup> is influenced by the chosen action and is given by the state transition function. Such a process can be described by Markov decision processes (MDPs) [36,37]. The goal of reinforcement learning is to formulate the problem as an MDP and find the optimal strategy [38]. The so-called strategy refers to the state-to-action mapping, which commonly uses the symbol policy π. It refers to a mapping on the action set for a given state *s*, that is, π(*a*|*s*) = *p*[*At* = *a St* <sup>=</sup> *<sup>s</sup>*]. Reinforcement learning introduces a reward function to represent the return value at a certain time *t*, as follows:

$$\mathbf{G}\_{t} = r\_{t+1} + \gamma r\_{t+2} + \gamma^{2} r\_{t+3} + \dots = \sum\_{k=0}^{\infty} \gamma^{k} r\_{t+k+1} \tag{1}$$

where *r*represents an immediate reward and γ represents a discount factor which shows how important future returns are relative to current returns. The action–value function used in reinforcement learning algorithms describes the expected return after taking an action in state *St* and thereafter following policy π:

$$Q^{\pi}(s\_t, a\_t) = \sum\_{r\_{i \ge t}, s\_{i > t} \sim E, a\_{i > t} \sim \pi} [R\_t | s\_{t \prime} a\_t] \tag{2}$$

Reinforcement learning makes use of the recursive relationship known as the Bellman equation:

$$Q^{\pi}(s\_t, a\_t) = \sum\_{r\_t s\_{t+1} \sim \mathbb{E}} \left[ r(s\_t, a\_t) + \gamma \sum\_{a\_{t+1} \sim \pi} [Q^{\pi}(s\_{t+1}, a\_{t+1})] \right] \tag{3}$$

The expectation depends only on the environment, which means that it is possible to learn *Q*<sup>μ</sup> off-policy using transitions that are generated from a different stochastic behavior policy β. Q-learning, a commonly used off-policy algorithm, uses the greedy policy μ(*s*) = argmax*<sup>a</sup> Q*(*s,a*). We consider function approximators parameterized by θ*Q*, which we optimize by minimizing the loss, as follows:

$$L\left(\theta^{Q}\right) = \sum\_{\kappa\_{t}\sim\rho^{\delta}, a\_{t}\sim\beta, r\_{t}\sim\mathbb{Z}} \left[ \left( Q\left(s\_{t}, a\_{t}|\theta^{Q}\right) - y\_{t} \right)^{2} \right] \tag{4}$$

where

$$y\_t = r(s\_{t\prime}a\_t) + \gamma Q(s\_{t+1\prime}\mu(s\_{t+1})|\theta^Q)^\*$$

Recently, deep Q-network (DQN) adapted the Q-learning algorithm in order to make effective use of large neural networks as function approximators. Before DQN, it was generally deemed that it was difficult and unstable to use large nonlinear function approximators for learning value functions. Thanks to two innovations, DQN can use a function approximator to learn the value function in a stable and robust manner: (a) the network is trained off-policy with samples from a replay buffer to minimize correlations between samples and (b) the target Q-network is used to train the network to provide consistent goals during time difference (TD) backups.

The deterministic policy gradient (DPG) algorithm maintains a parameterized actor function, μ(*s*|θμ), which specifies the current policy by deterministically mapping states to a specific action. The critic *Q*(*s,a*) is learned using the Bellman equation as in Q-learning. The actor is updated by applying the chain rule to the expected return from the start distribution *J* with respect to the actor parameters, as follows:

$$\begin{array}{l} \nabla\_{\partial^{\mu}} I \approx \sum\_{s\_{t} \sim \rho^{\beta}} [\nabla\_{\partial^{\mu}} Q(s, a|\theta^{Q})]\_{s = s\_{t}, a = \mu(s\_{t}|\theta^{\mu})} \\ = \sum\_{s\_{t} \sim \rho^{\beta}} [\nabla\_{a} Q(s, a|\theta^{Q})]\_{s = s\_{t}, a = \mu(s\_{t})} \nabla\_{\partial\_{\mu}} \mu(s|\theta^{\mu})|\_{s = s\_{t}}] \end{array} \tag{5}$$

Deep deterministic policy gradient (DDPG) combines the actor–critic (AC) approach based on deterministic policy gradient (DPG) [31] with insights from the recent success of deep Q-network (DQN). Although DQN has achieved great success in high-dimensional issues, like the Atari game, the action space for which the algorithm is implemented is still discrete. However, for many tasks of interest, especially physical industrial control tasks, the action space must be continuous. Note that if the action space is discretized too finely, the control problem eventually leads to excessive motion space, which is extremely difficult for the algorithm to learn. The DDPG strategy uses deep neural networks as approximators to effectively combine deep learning and deterministic strategy gradient algorithms. It can cope with high-dimensional inputs, achieve end-to-end control, output continuous actions, and thus can be applied to more complex situations with large state spaces and continuous action spaces. In detail, DDPG uses an actor network to tune the parameter θμ for the policy function, that is, decide the optimal action for a given state. A critic is used for evaluating the policy function

estimated by the actor according to the temporal TD error (see Figure 9). One issue for DDPG is that it rarely explores actions. A solution is to add noise to the parameter space or the action space. It is claimed that adding to parameter space is better than to action space [39]. One commonly used noise is the Ornstein–Uhlenbeck random process. Algorithm 1 shows the pseudo-code of the proposed DDPG algorithm.

**Figure 9.** Actor–critic architecture.

**Algorithm 1:** Pseudo-code of the Deep deterministic policy gradient (DDPG) algorithm.

Randomly initialize critic network *<sup>Q</sup>*(s, a <sup>θ</sup>*Q*) and actor <sup>μ</sup>(<sup>s</sup> <sup>θ</sup>μ) with weights <sup>θ</sup>*<sup>Q</sup>* and θμ. Initialize target network *Q* and μ with weights θ*Q* ←θ*Q*, θμ ←θ<sup>μ</sup> Initialize replay buffer R for episode = 1, M do Initialize a random process N for action exploration Receive initial observation state *s*<sup>1</sup> for t = 1, T do Select action *at* = μ(*st*|θμ) + *Nt* according to the current policy and exploration noise Execute action *at* and observe reward *rt* and observe new state *st*+<sup>1</sup> Store transition (*st*, *at*,*rt*,*st*+1) in R Sample a random minibatch of N transitions (*st*, *at*,*rt*,*st*+1) from R Set *yi* = *ri* + <sup>γ</sup>Q (*si*+1, μ (*si*+1|θμ )|θ*Q* ) Update critic by minimizing the loss: L = <sup>1</sup> *N <sup>i</sup>*(*yi* − *Q*(*si*, *ai* <sup>θ</sup>*Q*)) <sup>2</sup> Update the actor policy using the sampled gradient: <sup>∇</sup>θ<sup>μ</sup> <sup>J</sup> <sup>≈</sup> <sup>1</sup> *N i* ∇aQ(s, a|θ *<sup>Q</sup>*)| *<sup>a</sup>*=*si*,*a*=μ(*si*)·∇θμμ( <sup>s</sup>|<sup>θ</sup> <sup>μ</sup>)|*si* Update the target networks:

$$
\theta^{Q'} \leftarrow \tau \theta^Q + (1 - \tau) \theta^{Q'}
$$

$$
\theta^{\mu'} \leftarrow \tau \theta^{\mu} + (1 - \tau) \theta^{\mu'}
$$

end for end for

TensorFlow is one of the widely used end-to-end open-source platforms for machine learning. In order to draw on the research findings of DRL in other research fields, especially to re-use the existing program code frameworks in machine learning, we used Python compatible with TensorFlow as the algorithm design language in this study. Meanwhile, in order to apply the DRL algorithm built in Python to the diesel engine environment, we proposed to use MATLAB/Simulink as the program interface, so that the two-way transmission among Python, MATLAB/Simulink, and GT-SUITE could be reached. The specific DDPG algorithm implementation and the corresponding co-simulation platform

are shown in Figure 10. Key concepts of the DDPG-based boost control algorithm are formulated as follows.

**Figure 10.** Deep deterministic policy gradient (DDPG) algorithm implementation and the corresponding co-simulation platform.

The engine speed, the actual boost pressure, the target boost pressure, and the current vane position were chosen to group a four-dimensional state space. It should be noted here that only a small number of states were chosen in this study in order to (a) facilitate the training process and (b) showcase the generalization ability of the DRL techniques. The vane position controlled by a membrane vacuum actuator was selected as the control action. Immediate reward is important in the RL algorithm because it directly affects the convergence curves and, in some cases, a fine adjustment of the immediate reward parameter can bring the final policy to the opposite poles. The agents always try to maximize the reward they can get by taking the optimal action at each time step because more cumulative rewards represent better overall control behavior. Therefore, the immediate reward should be defined based on optimization goals. The control objective of this work was to track the target boost pressure under transient driving cycles by regulating the vanes in a quick and stable manner. Keeping this objective in mind, the function of the error between the target and the current boost pressure and the rate of action change were defined as the immediate reward. The equation for the immediate reward is given as follows:

$$r\_l = e^{-\frac{\|0.95 \cdot \mathbb{W}(t)\| + 0.05 \cdot \mathbb{W}\_t\|^2}{2}} - 1,\tag{6}$$

where *rt* is the immediate reward generated when the state changes by taking an action at time *t*. *e*(*t*) and *It* represent the error between the target and the current boost pressure and the rate of action change, respectively.

The corresponding DDPG parameters and the illustration of the actor–critic network are shown in Table 1 and Figure 11. In this study, the input layer of the actor network has four neurons, namely, the engine speed, the actual boost pressure, the target boost pressure, and the current vane position. There are three hidden layers each having 120 neurons. The output layer has one neuron representing the control action (i.e., the vane position). All these layers are fully connected. For the critic network, the input layer has an additional neuron, which is the control action implemented by the actor network, compared to that of the critic network. There is one hidden layer having 120 neurons. The output layer of the critic network has one neuron representing the value function of the selected action for the specific state. The network is trained for 50 episodes and each episode represents the first 80% time of the FTP-72 trips (1098s).

**Table 1.** DDPG parameters.

**Figure 11.** Illustration of the (**a**) actor network and (**b**) critic network.

#### **4. Results and Discussion**

In this article, the simulations were conducted based on an advanced co-simulation platform (see Figure 10). In order to validate its performance, the proposed DDPG algorithm was compared to a fine-tuned gain scheduled PID controller with both its P and I gains mapped as a function of speed and requested load. Without derivative action, a PI-controlled system may be less responsive, but it makes the system steadier at steady-state conditions (thus often adopted for industrial practice). It should be noted here that this PID controller adopted classic Ziegler–Nichols methods [40] to manually tune the control parameters, which took much effort and thus should be interpreted to represent a good control behavior benchmark. The US FTP-72 (Federal Test Procedure) driving cycle shown in Figure 7 was employed to verify the proposed strategy. The cycle simulated an urban route of 12.07 km with frequent stops and rapid accelerations. The maximum speed was 91.25 km/h and the average speed was 31.5 km/h. This transient driving cycle was selected because it mimics the real-world VGT environment system with large lag, strong coupling (especially with EGR) and nonlinear characteristics and thus, if a well-behaved control strategy in this environment is established, it should perform well in other driving cycles with more steady-state regions (such as the New European Driving Cycle (NEDC)). In this study, the first 80% time driving cycle was used to train the DDPG algorithm and the remaining data were destined for testing analysis. There are many different measures that can be used to compare the quality of controlled response. Integral absolute error (IAE), which integrates the absolute error over time, was used in this study to measure the control performance between the PID controller and the proposed DDPG algorithm.

Figure 12 shows the control performance using the fine-tuned PID controller. It can be seen that the actual boost pressure tracks the target boost pressure well at first glance. However, after zooming in on some operating conditions, a relatively large error can still be seen. This may be due to the turbo-lag, which cannot be improved from the control point of view (such as the time period from 10 s to 40 s, where although the VGT is already controlled to its minimum flow area for the fastest transient performance, it still exhibits lack of boost). Nevertheless, for most situations, taking the time period of 920 s to 945 s for example, there is still some room for the PID control strategy to improve. We note here that the results in Figure 12 are only a balance between control performance and tuning efforts, that is, a better control behavior can be achieved if the tuning process is made in a more finely manner, but more efforts and resources are required. In this research, the emphasis was not put on the final control performance comparison between PID and DRL theory, due to the fact that the structure behind each method is different and the control behavior, to a large extent, depends on how the control parameters are tuned. More focus was put on trying to solve the control problem in a self-learning manner and showcase good control adaptivity for the DRL approach.

The learning process of the DDPG algorithm can be seen in Figure 13. At the beginning of the learning, the cumulative rewards for the DDPG agent per episode were extremely low because (1) the agent (corresponding to the vane position actuator in the VGT boosting controller) only randomly selects actions in order to complete an extensive search process so as not to fall into local optimum and (2) the agent has no prior experience of what it should do for a specific state (thus the agent can only select the actions based on the initial DDPG parameters). After approximately 40 episodes, the algorithm has already been converged with the cumulative rewards, reaching a high value. This indicates that the agent has learned the experience to control the boost pressure. It should be noted that the learning process takes place only by direct interaction with the environment (in this case, the simulation software serves as the environment), without relying on model supervision or complete environment models, and a well-behaved control strategy is developed and finally formed autonomously from scratch. To answer the question of whether the learned controller was good enough, the control performance of the first 80% FTP-72 driving cycle using the final DDPG controller was compared with the performance based on the aforementioned fine-tuned PID controller. In Figures 12 and 14, it can be seen that both the PID and the proposed DDPG algorithm perform well at first glance, but after zooming in on some operating conditions, a large tracking disparity can still be seen. Although the PID controller seems to

track the boost pressure with relatively small errors, the control performance based on the proposed DDPG algorithm outperforms that of the PID controller with almost excellent tracking behavior. The IAE value of the PID control performance is 41.72, whereas the value based on the proposed DDPG algorithm can be as low as 37.43.

**Figure 12.** The fine-tuned gain-scheduled PID control performance.

**Figure 13.** Track of all rewards per episode.

**Figure 14.** Control performance for the first 80% FTP-72 driving cycle using the final DDPG training parameters.

This difference is shown better in Figure 15, where the control performance comparison between the fine-tuned PID and the proposed DDPG from the time period of 920 s to 945 s is made. This time period was selected because it corresponds to the frequently used engine operating regions.

**Figure 15.** Control performance comparison between the fine-tuned PID and the proposed DDPG from the time period of 920 s to 945 s.

In order to showcase the generalization ability of the proposed DRL techniques, the control performance for the end 20% FTP-72 driving cycle based on the trained DDPG parameters was simulated. It can be seen in Figure 16 that although the control parameters were not trained based on this part of the driving cycle (i.e., some of the states may not have been visited in the previous training process), the performance still exhibits good control behavior. Compared to the same time period using the fine-tuned PID controller (already optimized for this time period), the control performance based on the proposed DDPG clearly performs better and the IAE of the PID and the proposed DDPG are 10.17 and 8.35, respectively.

**Figure 16.** Control performance for the end 20% FTP-72 driving cycle based on the trained DDPG algorithm.

As the control strategy based on the proposed DDPG algorithm is able to achieve (or improve, depending on the tuning efforts of each algorithm) the control performance compared to a fine-tuned PID benchmark controller, it could replace the traditional PID controller for boosting control in the

near future. Compared to the benchmark PID controller whose parameters traditionally require manual adjustment (thus the tuning efficiency is low), the control strategy based on DDPG is able to adaptively adjust the algorithm strategy in the learning process, which not only can save a lot of manpower resources, but also adapt more to the changing environment and hardware aging over time (thus being unbiased by modelling errors). To prove this, another simulation model with a different combustion and turbocharger model was used. This was a simplified replication of a real engine plant whose transient performance could be diverged from the simulation prediction mainly due to combustion and turbocharger modelling inaccuracy. Figure 17 shows the control performance using both the pre-trained algorithm (which indicates the off-line behavior) and the strategy after continuing on-line learning in the "real engine" simulation model. It can be seen that the off-line policy is able to achieve a relatively good control behavior and can be improved further by continuing its learning from the interaction with the new environment on-line. Thus, different from other studies whose control parameters optimized in the simulation platform, for most cases, are no longer valid in the experimental test, the control strategy based on the proposed DRL techniques can combine the simulation training and the experimental continuing training together in order to fully utilize the computational resources off-line and refine the algorithm in the experimental environment on-line.

**Figure 17.** Control performance using the pre-trained off-line algorithm and the strategy after continuing on-line learning in the "real engine" simulation model.

Furthermore, because the learning process of the proposed DDPG algorithm distinguishes itself from other approaches by putting its emphasis on interacting with the transient environment, the final control performance is able to outperform that of the other approaches whose techniques are only based on the steady-state simulation or experimental control behavior. The most obvious example would be its capability to exceed the classic feedforward control which only responds to its control signal in a pre-defined way without responding to how the load reacts. It is known that most of the pre-defined map in a controller with feedforward function is calibrated in a steady-state environment in industry and is fixed for the entire product lifecycle. The proposed DDPG algorithm, however, because the control action adapts to the environment, is equivalent to the concept of the so-called automatic transient calibration.

#### **5. Conclusions and Future Work**

In this paper, a model-free self-learning DDPG algorithm was employed for the boost control of a VGT-equipped engine with the aim to compare the DDPG techniques with traditional control methods and to provide references for the further development of DDPG algorithms on sequential decision-control problems with other industrial applications. Using a fine-tuned PID controller as a benchmark, the results show that the control performance based on the proposed DDPG algorithm can achieve a good transient control performance from scratch by autonomously learning the interaction with the environment, without relying on model supervision or complete environment models. In addition, the proposed strategy is able to adapt to the changing environment and hardware aging over time by adaptively tuning the algorithm in a self-learning manner on-line, making it attractive to real plant control problems whose system consistency may not be strictly guaranteed and whose environment may change over time. This indicates that the control strategy based on the proposed DRL techniques can combine the simulation training and the experimental continuing training together in order to fully utilize the computational resources off-line and refine the algorithm in the experimental environment on-line. Future work may include applying DRL-based parallel computer architecture to boost the computational efficiency for the control problem with high-order, large lag, strong coupling, and nonlinear characteristics. Another interesting direction could be combining some look-ahead strategies with the proposed DRL techniques to accelerate the training process and improve the final control performance. The stability control between multiple reinforcement learning-based controllers will also be studied, which includes distributed RL and hierarchical RL.

**Author Contributions:** B.H. drafted the paper and provided the overall research ideas. J.Y. provided reinforcement learning strategy suggestions. S.L. and J.L. analyzed the data. H.B. revised the paper. These authors are contributed equally to this work.

**Funding:** This work was supported by the National Natural Science Foundation of China (Grant No. 51905061), the Chongqing Natural Science Foundation (Grant No. cstc2019jcyj-msxmX0097), the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant No. KJQN201801124), the Venture & Innovation Support Program for Chongqing Overseas Returners (Grant No. cx2018135), and the Open Project Program of the State Key Laboratory of Engines (Tianjin University) (Grant No. k2019-02), to whom the authors wish to express their thanks.

**Acknowledgments:** The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

**Data Availability:** The data used to support the findings of this study are available from the corresponding author upon request.

### **Nomenclature**

