1. Introduction
Energy crises and climate change are growing public concerns worldwide. Mass fuel consumption and waste gas emission impact negatively upon these issues. There is an urgent need to develop energy-saving vehicles to help in eliminating energy crises and climate change. Among all environmentally-friendly vehicles, hybrid electric vehicles (HEVs) enjoy a longer driving distance, compared to pure electric vehicles, and have lower fuel consumption, compared to conventional vehicles [
1]. The technology involved in these vehicles is also the most practical new energy vehicle technology to contribute to meeting China’s 2025 target in fuel consumption (4.0 L per 100 km for passenger cars). In HEVs, however, the energy management system is far more complex than in conventional vehicles and pure electric vehicles. The energy management system (EMS) is a complex system for the supervisory control of the hybrid powertrain, developed to determine the optimal distribution of energy flow in an HEV to satisfy the driver’s demand and achieve maximum energy efficiency [
2]. Thus, the EMS of HEVs has become a research hotspot in the automotive field.
The energy management system of hybridization and electrification is critically important for the development of future passenger cars. A considerable amount of literature has been published on energy management strategies. These studies can mainly be divided into the following three categories of strategies: rule-based, optimization-based, and learning-based strategies. The rule-based EMS heavily depends on the engineer’s experience and expertise. The two streams of heuristic EMS, deterministic rule-based and fuzzy logic, are implemented extensively in the automotive field, due to their structural simplicity and real-time character in practice-oriented applications [
3,
4]. In particular, fuzzy logic control enables the EMS to have the capability to handle numerical data and linguistic knowledge. Nevertheless, due to the non-linearity of hybrid powertrains and the uncertainty brought by dynamics in real-world driving environments, their optimization and adaptation to various driving conditions cannot usually be guaranteed. The optimization-based EMS mainly consists of global optimization algorithms and real-time optimization algorithms. Typical global optimization algorithms mainly include dynamic programming (DP) [
5,
6], genetic algorithm (GA) [
7,
8], and particle swarm optimization (PSO) [
9,
10], which are computationally expensive and are usually performed offline as a benchmark for evaluating the effectiveness of other online EMSs, providing the full load situation in advance. Real-time optimal control converts the global optimization problem into an instantaneous one to improve online execution feasibility as a compromise, such as Pontryagin’s minimum principle (PMP) [
11,
12,
13], equivalent consumption minimization strategy (ECMS) [
14,
15] and model predictive control (MPC) [
16,
17]. However, the key to the success of MPC is fast prediction and rapid optimization of the strategy. It is necessary to predict the road conditions in advance for MPC, which depends greatly on the model. The effectiveness of PMP is good, but the co-state is difficult to obtain, requiring large computational effort. ECMS has fine real-time characteristics, but historical information used to calculate the equivalent fuel consumption does not necessarily represent future driving conditions, leading to poor robustness of the algorithm.
At present, more state-of-the-art methodologies are being explored and investigated by researchers to optimize problems for HEV/PHEV energy management in real-time with the help of cloud computing and artificial intelligence (AI) [
18,
19]. Machine learning techniques, and, especially, the reinforcement learning techniques developed in recent years, open up new promises in meeting this challenge, and research into these solutions has been widely reported [
20]. A series of predictive EMS was proposed in [
21,
22], using traditional Q-learning, which can dramatically improve vehicle performance compared with conventional rule-based strategies. Two novel model-free heuristic action execution policies were investigated in [
23] for the double Q-Learning method, namely the max-value-based policy and the random policy. The proposed double Q-Learning strategy reduced the overestimation of the merit–function values for each action in power split for the vehicle, and the hardware-in-the-loop test validated the energy-saving performance by 4.55% in predefined real-world driving conditions, compared with using standard Q-Learning. Such approaches are suitable only for processing action spaces with discrete and low-dimensional features, but the action spaces of HEV energy management control problems are continuous and, instead, have high-dimensional characteristics. Perhaps the most effective solution to this problem is to discretize the continuous action space, but discretization has many limitations and, in addition, dimensionality is the most notable defect. The number of actions rises exponentially with the increase of degrees of freedom. Small-scale action spaces needlessly throw away structural information of the action domain, which may not be conducive to solving many problems. Large-scale action spaces are not easy to explore efficiently and it is likely intractable to train Q-Learning-like networks successfully in these action spaces. In recent years, deep reinforcement learning (DRL) based on value function and policy has been applied to the development of intelligent HEV EMS. The deep deterministic policy gradient (DDPG) strategy is an online actor–critic model-free off-policy reinforcement learning strategy, and the DDPG agent computes an optimal policy that maximizes the long-term reward and can be trained in environments with continuous or discrete state spaces and continuous action spaces [
24]. Ref. [
25] exploited expert knowledge and included DDPG-based EMS for the hybrid electric bus (HEB), which required no discretization of both states and actions. Battery thermal safety and degradation were taken into account during the formulation of the DDPG agent. Unfortunately, the overestimation of Q-values leads to incremental bias and sub-optimal policies, which is a common shortcoming of DDPG.
According to a survey of the literature on model-free DRL agents, extensive research has been carried out on the metrics of the overall performance of different DRL agents, the global optimality, convergence speed, computational efficiency, robustness, and generalization ability. To this end, several techniques were successively integrated with existing DRL agents. These extensions primarily aim to improve the exploitation and exploration mechanism. Quan et al. proposed three novel model-free multi-steps reinforcement learning strategies (sum-to-terminal, average-to-neighbor and recurrent-to-terminal) to accelerate the learning process of agents. Moreover, the hardware-in-the-loop test showed that the proposed energy management method could increase the prediction horizon length by 71% and save energy by at least 7.8% under the same driving conditions, compared with a well-designed model-based predictive energy management control policy [
26]. Ref. [
23] applied layered topology on the double Q learning EMS strategy to relieve the computational effort of the onboard controller. The learning layer was deployed on the powerful server computer and the control layer was installed in the onboard controller. The two layers communicated through the V2X network. The introduction of an optimal brake specific fuel consumption (BSFC) curve and battery characteristics into the DDPG-based EMS accelerated the learning process and achieved better fuel economy [
27].
Recently, active investigations of DRL algorithms obtained fruitful achievements. However, several shortcomings need to be addressed. Firstly, existing literature does not sufficiently focus on the area of real-time implementation of deep reinforcement learning [
28]. The complexity and non-linearity of the EMS system in HEV can reduce the efficiency of the policies and many optimization objectives inevitably lead to state space redundancy, which is not conducive to the real-time implementation of deep reinforcement learning. Next, most of the investigations in this domain have concentrated solely on fuel economy, neglecting the impact of operating conditions on onboard LIB systems [
29]. Operating conditions deeply affect the lifetime of LIB systems [
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32]. Finally, the large number of hyper-parameters for deep reinforcement learning has already made most researchers and engineers anxious, and the regulation of multi-objective weighting parameters for the reward function has aggravated the situation. Moreover, it is not easy to balance multiple optimization objectives by weighting parameters in fast-changing driving scenarios, causing the performance of energy management strategies to degrade.
To bridge these gaps, a novel EMS is proposed for a power-split HEV and the present work includes the following three contributions: (1) a Non-Parametric Rewarding TD3 algorithm (NPR-TD3) is proposed to alleviate the burden of weighting parameter tuning; (2) state space refinement techniques are discussed to target increasing the potential of real-time implementation of the TD algorithm; (3) battery degradation is taken into account in the proposed EMS to improve the management quality.
The essay has been organized in the following way.
Section 2 of this paper introduces vehicle modeling of the HEV and energy management formulation. The NPR-TD3 strategy and state space refinement techniques are elaborated in
Section 3, followed by the discussion of simulation results in
Section 4. Conclusions are summarized in
Section 5.
3. Energy Management System
3.1. Fundamentals of TD3 Algorithm
The DDPG algorithm is an extension of deep Q-networks (DQNs) aiming to address the curse of the dimensionality problem and deal with the control tasks with continuous action spaces, such as optimal power allocation for the hybrid electric powertrain [
24]. The employment of the actor–critic method converts Monte Carlo-based updates into temporal difference updates to learn the parameterized policy. Meanwhile, by incorporating target network and experience replay from DQN, traditional on-policy actor–critic is converted to off-policy, which improves the sample efficiency. However, some inherent limitations for DDPG have not been solved.
Since both DDPG and DQN update the Q-value with the same method, the max operator
tends to induce the problem of overestimation of Q-value for some actions, which results in incremental bias and sub-optimal policy [
35]. Furthermore, hyper-parameters may have a direct impact on the stability of network convergence, as DDPG is extremely sensitive to the settings of the parameters. The large number of hyper-parameters already makes researchers distressed, and the weighting parameters of the reward function exacerbates this hardship. In order to copy with aforementioned defects, the TD3 algorithm in [
36], one of the state-of-the-art DRL algorithms, incorporates a non-parametric reward function to manage the power allocation between engine and motors and battery degradation to improve energy efficiency. TD3 algorithms make use of clipped double Q-learning, delayed policy updates and target policy smoothing to address the overestimation problem of DDPG. The network architecture of TD3 is depicted in
Figure 2.
Firstly, TD3 uses two independent Q-value networks to compute the value of the next state, which mimics the idea of double Q-learning:
where,
and
are two Q-values;
is the reward;
is the discount factor;
is the state for next time step; and
is the parameter for target actor network. TD3 uses two independent Q-values and takes the clipped minimum of the two values to form a target Q-value, so as to offset the overestimation of Q-value under Bellman equations and calculate TD-error and loss function as in Equations (14) and (15).
In spite of the fact that this Q-value update rule may generate an underestimation bias, compared to the standard Q-learning approach, the underestimated actions are not propagated explicitly through policy updates.
Secondly establishing a target network as a depth function approximator provides a stable target for the learning process and improves the convergence performance. Yet, states observed with errors easily lead to divergence. Hence, the policy network is designed to update less frequently than the value network, in an attempt to minimize error propagations. The lower the frequency of policy updates, the smaller the update variance of the q-value function and the higher the quality of the obtained policy.
Thirdly, the calculation of Q-values must be smoothed to avoid overfitting with the aim of resolving the trade-off between bias and variations. Therefore, a truncated normally distributed noise was added to each action as a regulation, which resulted in a modified target update as in Equation (16).
where
is the action from the actor network with noise
added to the action;
is the bound of noise, and
is the variance.
3.2. State Space Refinement Techniques
Algorithm simplification is a potential solution to improve the DRL real-time implementation performance, like converting DRL algorithms from simulation environment to hardware exploration [
28]. Algorithm simplification can be achieved by reducing the number of inputs and DRL method complexity (e.g., refinement of state space).
Usually, the states can be roughly classified into direct states and indirect states, depending on the timing of the feedback from the reward function. Direct states are immediately linked to the reward function. On the contrary, indirect states are not instantly linked to the reward function, and the events they represent take more time to get feedback [
37]. Driver torque demand, vehicle power demand, vehicle velocity, vehicle acceleration, fuel consumption, battery SOC, battery SOH, and battery current are directly related to the objectives (fuel economy and battery degradation) of the energy management system; therefore, they are direct states. Whereas clutch status and transmission are not immediately reflected in vehicle fuel consumption or battery degradation, so they are indirect states. It is more difficult and less efficient for DRL to establish decision-making correlations using indirect states. Obviously, including all the above signals into state space inevitably leads to state space redundancy. So, state space refinement techniques are necessary.
Firstly, to facilitate the DRL algorithm to reach the optimization goal faster and better, the state space must be designed in accordance with the deep reinforcement learning reward function. Next, direct and indirect states are classified, and only direct states are introduced into the state space. Finally, signals with the same role in the state space are excluded. Vehicle velocity and acceleration have similar roles in the energy management system, so only velocity is introduced into the state space. Vehicle torque, power demand and velocity are closely related, and any one of them can be calculated from the other two signals; therefore, vehicle torque demand and velocity are introduced into the state space. Battery SOC and battery SOH can be derived from battery current, so battery current is introduced into the state space.
From the above analysis, the state space of the proposed EMS is , where is the vehicle speed; means the driver torque demand; represents the engine fuel consumption and is the battery current. The equivalent output torque of the engine is the control action, where and are the equivalent and actual torque outputs, respectively; and are the maximum and minimum output torques of the engine, respectively.
No matter which sequence of the control actions is selected, the internal combustion engine (ICE), battery, integrated starter generator (ISG), and traction motor should work in a reasonable range. These constraints are defined in Equations (17) and (18):
3.3. Non-Parametric Reward Function
The three modifications adopted by TD3, namely, clipped double Q-learning, delayed policy updates, target policy smoothing, can largely improve the overestimation bias in DDPG. The reward function is very powerful and is a non-negligible part in improving the performance of deep reinforcement learning [
38]. The TD3 algorithm may suffer from poor exploration performance and adaptability because of the unreasonable setting of the weighting parameters of the reward function. To bridge this gap, this paper proposed the non-parametric reward function.
The goal of EMS is to compute a control sequence to reduce the cost attributed to fuel consumption and LIB degradation, while maintaining the charge margin. The overall targets of EMS can be boiled down to the following cost function:
where
is the fuel consumption of the engine (kg),
denotes cost related to SOC deviation. SOC and SOH are the battery state of charge and state of health, respectively.
is the reference of
SOC; The
of LIB should be well controlled to ensure enough margin for charge/discharge. Inspired by Equation (19), the reward function is defined alternatively as
. It is worth mentioning that all three parts (
) included in the reward function need to be scaled to the same magnitude.
The settings of the weighting parameters of the reward function in the DRL method are time-consuming and add more complexity. Therefore, researchers need to reduce the number of parameters that need to be adjusted in the reward function. This non-parametric reward function is a generalization of the currently dominant parametric reward function with multiple weighting parameters [
25,
39,
40], in which the battery SOC plays the role of weighting parameters. From Equation (19), it can be seen that when the battery is fully charged, with high battery SOC, the reward is mainly calculated based on the fuel consumption, which means that reducing the fuel consumption is the only way to obtain a higher reward. On the contrary, when the battery is running at low charge, with low battery SOC, the reward is mainly calculated based on the battery power, and this implicitly shows that reducing the battery charge state deviation is the primary way to gain an increased reward. Therefore, the proposed reward function does not require the adjustment of additional parameters. The architecture of the NPR-TD3 based strategy is shown in
Figure 3.