Next Article in Journal
Prediction of NOx Emissions from a Coal-Fired Boiler Based on Convolutional Neural Networks with a Channel Attention Mechanism
Previous Article in Journal
Microbial Granule Technology—Prospects for Wastewater Treatment and Energy Production
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Objective Energy Management Strategy for Hybrid Electric Vehicles Based on TD3 with Non-Parametric Reward Function

1
Hubei Key Laboratory of Advanced Technology for Automotive Components, Wuhan University of Technology, Wuhan 430070, China
2
Hubei Research Center for New Energy & Intelligent Connected Vehicle, Wuhan 430070, China
3
Department of Mechanical Engineering, University of Birmingham, Birmingham B15 2TT, UK
*
Author to whom correspondence should be addressed.
Energies 2023, 16(1), 74; https://doi.org/10.3390/en16010074
Submission received: 16 November 2022 / Revised: 12 December 2022 / Accepted: 14 December 2022 / Published: 21 December 2022

Abstract

:
The energy management system (EMS) of hybridization and electrification plays a pivotal role in improving the stability and cost-effectiveness of future vehicles. Existing efforts mainly concentrate on specific optimization targets, like fuel consumption, without sufficiently taking into account the degradation of on-board power sources. In this context, a novel multi-objective energy management strategy based on deep reinforcement learning is proposed for a hybrid electric vehicle (HEV), explicitly conscious of lithium-ion battery (LIB) wear. To be specific, this paper mainly contributes to three points. Firstly, a non-parametric reward function is introduced, for the first time, into the twin-delayed deep deterministic policy gradient (TD3) strategy, to facilitate the optimality and adaptability of the proposed energy management strategy and to mitigate the effort of parameter tuning. Then, to cope with the problem of state redundancy, state space refinement techniques are included in the proposed strategy. Finally, battery health is incorporated into this multi-objective energy management strategy. The efficacy of this framework is validated, in terms of training efficiency, optimality and adaptability, under various standard driving tests.

1. Introduction

Energy crises and climate change are growing public concerns worldwide. Mass fuel consumption and waste gas emission impact negatively upon these issues. There is an urgent need to develop energy-saving vehicles to help in eliminating energy crises and climate change. Among all environmentally-friendly vehicles, hybrid electric vehicles (HEVs) enjoy a longer driving distance, compared to pure electric vehicles, and have lower fuel consumption, compared to conventional vehicles [1]. The technology involved in these vehicles is also the most practical new energy vehicle technology to contribute to meeting China’s 2025 target in fuel consumption (4.0 L per 100 km for passenger cars). In HEVs, however, the energy management system is far more complex than in conventional vehicles and pure electric vehicles. The energy management system (EMS) is a complex system for the supervisory control of the hybrid powertrain, developed to determine the optimal distribution of energy flow in an HEV to satisfy the driver’s demand and achieve maximum energy efficiency [2]. Thus, the EMS of HEVs has become a research hotspot in the automotive field.
The energy management system of hybridization and electrification is critically important for the development of future passenger cars. A considerable amount of literature has been published on energy management strategies. These studies can mainly be divided into the following three categories of strategies: rule-based, optimization-based, and learning-based strategies. The rule-based EMS heavily depends on the engineer’s experience and expertise. The two streams of heuristic EMS, deterministic rule-based and fuzzy logic, are implemented extensively in the automotive field, due to their structural simplicity and real-time character in practice-oriented applications [3,4]. In particular, fuzzy logic control enables the EMS to have the capability to handle numerical data and linguistic knowledge. Nevertheless, due to the non-linearity of hybrid powertrains and the uncertainty brought by dynamics in real-world driving environments, their optimization and adaptation to various driving conditions cannot usually be guaranteed. The optimization-based EMS mainly consists of global optimization algorithms and real-time optimization algorithms. Typical global optimization algorithms mainly include dynamic programming (DP) [5,6], genetic algorithm (GA) [7,8], and particle swarm optimization (PSO) [9,10], which are computationally expensive and are usually performed offline as a benchmark for evaluating the effectiveness of other online EMSs, providing the full load situation in advance. Real-time optimal control converts the global optimization problem into an instantaneous one to improve online execution feasibility as a compromise, such as Pontryagin’s minimum principle (PMP) [11,12,13], equivalent consumption minimization strategy (ECMS) [14,15] and model predictive control (MPC) [16,17]. However, the key to the success of MPC is fast prediction and rapid optimization of the strategy. It is necessary to predict the road conditions in advance for MPC, which depends greatly on the model. The effectiveness of PMP is good, but the co-state is difficult to obtain, requiring large computational effort. ECMS has fine real-time characteristics, but historical information used to calculate the equivalent fuel consumption does not necessarily represent future driving conditions, leading to poor robustness of the algorithm.
At present, more state-of-the-art methodologies are being explored and investigated by researchers to optimize problems for HEV/PHEV energy management in real-time with the help of cloud computing and artificial intelligence (AI) [18,19]. Machine learning techniques, and, especially, the reinforcement learning techniques developed in recent years, open up new promises in meeting this challenge, and research into these solutions has been widely reported [20]. A series of predictive EMS was proposed in [21,22], using traditional Q-learning, which can dramatically improve vehicle performance compared with conventional rule-based strategies. Two novel model-free heuristic action execution policies were investigated in [23] for the double Q-Learning method, namely the max-value-based policy and the random policy. The proposed double Q-Learning strategy reduced the overestimation of the merit–function values for each action in power split for the vehicle, and the hardware-in-the-loop test validated the energy-saving performance by 4.55% in predefined real-world driving conditions, compared with using standard Q-Learning. Such approaches are suitable only for processing action spaces with discrete and low-dimensional features, but the action spaces of HEV energy management control problems are continuous and, instead, have high-dimensional characteristics. Perhaps the most effective solution to this problem is to discretize the continuous action space, but discretization has many limitations and, in addition, dimensionality is the most notable defect. The number of actions rises exponentially with the increase of degrees of freedom. Small-scale action spaces needlessly throw away structural information of the action domain, which may not be conducive to solving many problems. Large-scale action spaces are not easy to explore efficiently and it is likely intractable to train Q-Learning-like networks successfully in these action spaces. In recent years, deep reinforcement learning (DRL) based on value function and policy has been applied to the development of intelligent HEV EMS. The deep deterministic policy gradient (DDPG) strategy is an online actor–critic model-free off-policy reinforcement learning strategy, and the DDPG agent computes an optimal policy that maximizes the long-term reward and can be trained in environments with continuous or discrete state spaces and continuous action spaces [24]. Ref. [25] exploited expert knowledge and included DDPG-based EMS for the hybrid electric bus (HEB), which required no discretization of both states and actions. Battery thermal safety and degradation were taken into account during the formulation of the DDPG agent. Unfortunately, the overestimation of Q-values leads to incremental bias and sub-optimal policies, which is a common shortcoming of DDPG.
According to a survey of the literature on model-free DRL agents, extensive research has been carried out on the metrics of the overall performance of different DRL agents, the global optimality, convergence speed, computational efficiency, robustness, and generalization ability. To this end, several techniques were successively integrated with existing DRL agents. These extensions primarily aim to improve the exploitation and exploration mechanism. Quan et al. proposed three novel model-free multi-steps reinforcement learning strategies (sum-to-terminal, average-to-neighbor and recurrent-to-terminal) to accelerate the learning process of agents. Moreover, the hardware-in-the-loop test showed that the proposed energy management method could increase the prediction horizon length by 71% and save energy by at least 7.8% under the same driving conditions, compared with a well-designed model-based predictive energy management control policy [26]. Ref. [23] applied layered topology on the double Q learning EMS strategy to relieve the computational effort of the onboard controller. The learning layer was deployed on the powerful server computer and the control layer was installed in the onboard controller. The two layers communicated through the V2X network. The introduction of an optimal brake specific fuel consumption (BSFC) curve and battery characteristics into the DDPG-based EMS accelerated the learning process and achieved better fuel economy [27].
Recently, active investigations of DRL algorithms obtained fruitful achievements. However, several shortcomings need to be addressed. Firstly, existing literature does not sufficiently focus on the area of real-time implementation of deep reinforcement learning [28]. The complexity and non-linearity of the EMS system in HEV can reduce the efficiency of the policies and many optimization objectives inevitably lead to state space redundancy, which is not conducive to the real-time implementation of deep reinforcement learning. Next, most of the investigations in this domain have concentrated solely on fuel economy, neglecting the impact of operating conditions on onboard LIB systems [29]. Operating conditions deeply affect the lifetime of LIB systems [20,21,22,23,24,25,26,27,28,29,30,31,32]. Finally, the large number of hyper-parameters for deep reinforcement learning has already made most researchers and engineers anxious, and the regulation of multi-objective weighting parameters for the reward function has aggravated the situation. Moreover, it is not easy to balance multiple optimization objectives by weighting parameters in fast-changing driving scenarios, causing the performance of energy management strategies to degrade.
To bridge these gaps, a novel EMS is proposed for a power-split HEV and the present work includes the following three contributions: (1) a Non-Parametric Rewarding TD3 algorithm (NPR-TD3) is proposed to alleviate the burden of weighting parameter tuning; (2) state space refinement techniques are discussed to target increasing the potential of real-time implementation of the TD algorithm; (3) battery degradation is taken into account in the proposed EMS to improve the management quality.
The essay has been organized in the following way. Section 2 of this paper introduces vehicle modeling of the HEV and energy management formulation. The NPR-TD3 strategy and state space refinement techniques are elaborated in Section 3, followed by the discussion of simulation results in Section 4. Conclusions are summarized in Section 5.

2. Vehicle Modeling and Energy Management Formulation

Appropriate powertrain architecture, component sizing and optimal control strategy are beneficial to the fuel economy of HEVs. This involves heterogeneous and disparate technologies at the infrastructural level, vehicle level, and subsystem level. Vehicle dynamics modeling can be at low frequency or high frequency, depending on the purpose. As this paper mainly focuses on multi-objective optimization of the HEV, a quasi-static pre-transmission parallel HEV model is adopted and battery thermal and degradation models are well established.

2.1. Vehicle Structure

The schematic of the pre-transmission parallel heavy-duty HEV model is shown in Figure 1, and it is simulated in Matlab/Simulink. The powertrain must fulfil road power demand to deliver the required tractive force. This can be calculated from the longitudinal dynamic equation:
F ( t ) = 1 2 C D A ρ v ( t ) 2 + m g f + δ m a ( t )
where, C D is the coefficient of air resistance, ρ is the air density, g is the gravity, f is the rolling resistance coefficient, δ is the correlation coefficient of rotating mass, A , m , v ( t ) and a ( t ) are the frontal area, total mass, longitudinal velocity and acceleration, respectively. The value t is the length of driving conditions.
The engine, connected in parallel with the electric motor, can be engaged or disengaged from wheels through a clutch. The vehicle mainly operates in electric mode, parallel mode with neutral gear and parallel mode depending on the clutch status and gear position [33].
In the electric mode, the clutch of the vehicle is open and only the battery and electric motor are used for propulsion. The engine is switched off and not connected to the wheels. As only one propulsion plant is available in this mode, the torque/power required by the driver at the wheels can be met entirely by the electric drivetrain and no optimization is required. The equation for the instantaneous torque/power balance is shown below:
{ T m o t ( t ) = T g b ( t ) P b a t t ( t ) = P m o t e ( t ) + P a c c e ω m o t ( t ) = ω g b ( t ) t [ 0 , T ]
where T g b ( t ) and ω g b ( t ) are the transient torque and speed of the gearbox; P b a t t ( t ) means the battery power; T m o t ( t ) and ω m o t ( t ) are the transient torque and speed of the electric motor; the power of electrical accessory and electric motor are represented by P a c c e and P m o t e , respectively. The value P a c c e is assumed to be a constant. Moreover:
P m o t m = { η m o t P m o t e P m o t e > 0 1 η m o t P m o t e P m o t e < 0
where, the mechanical power and efficiency of the electric motor are represented by P m o t m ( t ) and η m o t .
When the vehicle is at a standstill, with the clutch closed and the transmission in neutral position, the vehicle operates in parallel mode with neutral gear, because of which the engine can freely alter its speed, although it remains connected to the transmission. The equations for the balance of torque, power and speed are listed below:
{ T i c e ( t ) T a c c m ( t ) = T m o t ( t ) P b a t t ( t ) = P m o t e ( t ) + P a c c e ω m o t ( t ) = ω i c e ( t ) t [ 0 , T ]
where, T i c e ( t ) and ω i c e ( t ) stand for the engine torque and speed; T a c c m ( t ) indicates the torque of the mechanical accessory. In this mode, the total power demand at the wheels is zero and the power balance equation is shown in Equation (5):
P g b ( t ) = 0 = P i c e ( t ) P a c c m ( t ) + 1 η m o t ( P b a t t ( t ) P a c c e )
where, P g b ( t ) means the power of the gearbox; P a c c m is the mechanical power of the electrical accessory; P i c e is the power of the engine. In parallel mode, the motor and engine operate simultaneously to provide propulsion for the vehicle, the clutch is closed and the engine is connected to the wheels. The speed at the wheels determines the speed of the motor and engine via the transmission. The powertrain equations are shown in Equation (6):
{ T i c e ( t ) T a c c m + T m o t ( t ) = T g b ( t ) P b a t t ( t ) = P m o t e ( t ) + P a c c e ω m o t ( t ) = ω i c e ( t ) = ω g b ( t ) t [ 0 , T ]
Hence, the total power balance can be expressed as Equation (7):
P g b = { P i c e P a c c m + η m o t ( P b a t t P a c c e ) P m o t e > 0 P i c e P a c c m + 1 η m o t ( P b a t t P a c c e ) P m o t e < 0
The main characteristics of the vehicle are shown in Table 1.

2.2. Battery Thermal and Health Model

Battery performance, life span and safety are strongly correlated with the battery internal temperature. The battery thermal model is based on the lumped thermal mass approach:
d T d t = 1 m c I ( O C V V ) h A m c ( T e m p T a m b )
where, T e m p is the temperature of the battery in the unit of K ; T a m b is the ambient temperature in the unit of K ; m is the battery mass (kg); c is the specific heat capacity ( J / k g K ); I is the battery cell current ( A ); O C V is the battery open circuit voltage ( V ); V is the battery operating voltage ( V ); h is the natural heat convection constant ( W / m 2 K ); and A is the battery module surface area ( m 2 ).
The state of health (SOH), which is used to describe its physical condition, is commonly characterized by a system parameter that is correlated with its ageing. In most applications, the SOH is correlated with the performance requirement. The control-orientated state of health model proposed in [34] was used to predict battery degradation. It was assumed that the LIB system could withstand specific cumulative charge flow before reaching the end-of-life (EOL). The dynamic expression for SOH is:
d S O H ( t ) d t = 1 2 N ( c r , T e m p ) C n 0 t | I ( t ) | d t
where, N ( c r , T e m p ) is the equivalent number of cycles till EOL, which considers the effect of C-rate ( c r ) and internal temperature on the number of cycles. C n is the nominal capacity of the battery. N ( c r , T e m p ) is calculated by the following Equation (3):
N ( c r , T e m p ) = 3600 A h ( c r , T e m p ) C n
where, B is the pre-factor in Table 2; R = 8.314 is the ideal gas constant; z = 0.55 is the power-law factor; A h is the ampere–hour throughput; E a is the activation energy derived from [33]:
E a ( c r ) = ( 31700 370.3 c r )
The LIB reaches the EOL when C n has fallen to 20%, subject to which, A h and C n can be obtained from Equation (12):
{ C n = B ( c r ) e E a ( c r ) R T e m p A h ( c r ) z A h ( c r , T e m p ) = [ 20 / ( B ( c r ) exp ( E a R T e m p ) ) ] 1 z

3. Energy Management System

3.1. Fundamentals of TD3 Algorithm

The DDPG algorithm is an extension of deep Q-networks (DQNs) aiming to address the curse of the dimensionality problem and deal with the control tasks with continuous action spaces, such as optimal power allocation for the hybrid electric powertrain [24]. The employment of the actor–critic method converts Monte Carlo-based updates into temporal difference updates to learn the parameterized policy. Meanwhile, by incorporating target network and experience replay from DQN, traditional on-policy actor–critic is converted to off-policy, which improves the sample efficiency. However, some inherent limitations for DDPG have not been solved.
Since both DDPG and DQN update the Q-value with the same method, the max operator y = r + γ max a ( s , a | θ Q ) tends to induce the problem of overestimation of Q-value for some actions, which results in incremental bias and sub-optimal policy [35]. Furthermore, hyper-parameters may have a direct impact on the stability of network convergence, as DDPG is extremely sensitive to the settings of the parameters. The large number of hyper-parameters already makes researchers distressed, and the weighting parameters of the reward function exacerbates this hardship. In order to copy with aforementioned defects, the TD3 algorithm in [36], one of the state-of-the-art DRL algorithms, incorporates a non-parametric reward function to manage the power allocation between engine and motors and battery degradation to improve energy efficiency. TD3 algorithms make use of clipped double Q-learning, delayed policy updates and target policy smoothing to address the overestimation problem of DDPG. The network architecture of TD3 is depicted in Figure 2.
Firstly, TD3 uses two independent Q-value networks to compute the value of the next state, which mimics the idea of double Q-learning:
{ y 1 = r + γ Q θ 1 ( s , μ ( s | θ μ ) ) y 2 = r + γ Q θ 2 ( s , μ ( s | θ μ ) )
where, y 1 and y 2 are two Q-values; r is the reward; γ is the discount factor; s is the state for next time step; and μ ( s | θ μ ) is the parameter for target actor network. TD3 uses two independent Q-values and takes the clipped minimum of the two values to form a target Q-value, so as to offset the overestimation of Q-value under Bellman equations and calculate TD-error and loss function as in Equations (14) and (15).
y = r + γ min i = 1 , 2 Q i ( s , a )
L k i = 1 M j = 1 M ( y j Q i ( s j , a j ) ) 2
In spite of the fact that this Q-value update rule may generate an underestimation bias, compared to the standard Q-learning approach, the underestimated actions are not propagated explicitly through policy updates.
Secondly establishing a target network as a depth function approximator provides a stable target for the learning process and improves the convergence performance. Yet, states observed with errors easily lead to divergence. Hence, the policy network is designed to update less frequently than the value network, in an attempt to minimize error propagations. The lower the frequency of policy updates, the smaller the update variance of the q-value function and the higher the quality of the obtained policy.
Thirdly, the calculation of Q-values must be smoothed to avoid overfitting with the aim of resolving the trade-off between bias and variations. Therefore, a truncated normally distributed noise was added to each action as a regulation, which resulted in a modified target update as in Equation (16).
a ˜ μ ( s | θ μ ) + ε , ε ~ c l i p ( N ( 0 , σ ) , c , c ) , c > 0
where a ˜ is the action from the actor network with noise ε added to the action; c is the bound of noise, and σ is the variance.

3.2. State Space Refinement Techniques

Algorithm simplification is a potential solution to improve the DRL real-time implementation performance, like converting DRL algorithms from simulation environment to hardware exploration [28]. Algorithm simplification can be achieved by reducing the number of inputs and DRL method complexity (e.g., refinement of state space).
Usually, the states can be roughly classified into direct states and indirect states, depending on the timing of the feedback from the reward function. Direct states are immediately linked to the reward function. On the contrary, indirect states are not instantly linked to the reward function, and the events they represent take more time to get feedback [37]. Driver torque demand, vehicle power demand, vehicle velocity, vehicle acceleration, fuel consumption, battery SOC, battery SOH, and battery current are directly related to the objectives (fuel economy and battery degradation) of the energy management system; therefore, they are direct states. Whereas clutch status and transmission are not immediately reflected in vehicle fuel consumption or battery degradation, so they are indirect states. It is more difficult and less efficient for DRL to establish decision-making correlations using indirect states. Obviously, including all the above signals into state space inevitably leads to state space redundancy. So, state space refinement techniques are necessary.
Firstly, to facilitate the DRL algorithm to reach the optimization goal faster and better, the state space must be designed in accordance with the deep reinforcement learning reward function. Next, direct and indirect states are classified, and only direct states are introduced into the state space. Finally, signals with the same role in the state space are excluded. Vehicle velocity and acceleration have similar roles in the energy management system, so only velocity is introduced into the state space. Vehicle torque, power demand and velocity are closely related, and any one of them can be calculated from the other two signals; therefore, vehicle torque demand and velocity are introduced into the state space. Battery SOC and battery SOH can be derived from battery current, so battery current is introduced into the state space.
From the above analysis, the state space of the proposed EMS is S = { V s p d , T r q d m d , m f , I b a t t } , where V s p d is the vehicle speed; T r q d m d means the driver torque demand; m f represents the engine fuel consumption and I b a t t is the battery current. The equivalent output torque of the engine T i c e e q = T i c e T i c e max T i c e min is the control action, where T i c e e q and T i c e are the equivalent and actual torque outputs, respectively; T i c e max and T i c e min are the maximum and minimum output torques of the engine, respectively.
No matter which sequence of the control actions is selected, the internal combustion engine (ICE), battery, integrated starter generator (ISG), and traction motor should work in a reasonable range. These constraints are defined in Equations (17) and (18):
S O C min S O C S O C max
{ ω x min ω x ω x max T x min T x T x max , x = m , g , e

3.3. Non-Parametric Reward Function

The three modifications adopted by TD3, namely, clipped double Q-learning, delayed policy updates, target policy smoothing, can largely improve the overestimation bias in DDPG. The reward function is very powerful and is a non-negligible part in improving the performance of deep reinforcement learning [38]. The TD3 algorithm may suffer from poor exploration performance and adaptability because of the unreasonable setting of the weighting parameters of the reward function. To bridge this gap, this paper proposed the non-parametric reward function.
The goal of EMS is to compute a control sequence to reduce the cost attributed to fuel consumption and LIB degradation, while maintaining the charge margin. The overall targets of EMS can be boiled down to the following cost function:
J = [ m f · S O C + C s o c · ( 1 S O C ) ] · ( 1 S O H )
C s o c = ( S O C r e f S O C ) 2
where m f is the fuel consumption of the engine (kg), C s o c denotes cost related to SOC deviation. SOC and SOH are the battery state of charge and state of health, respectively. S O C r e f = 0.625 is the reference of SOC; The S O C of LIB should be well controlled to ensure enough margin for charge/discharge. Inspired by Equation (19), the reward function is defined alternatively as R = J . It is worth mentioning that all three parts ( m f · S O C , C s o c · ( 1 S O C ) , ( 1 S O H ) ) included in the reward function need to be scaled to the same magnitude.
The settings of the weighting parameters of the reward function in the DRL method are time-consuming and add more complexity. Therefore, researchers need to reduce the number of parameters that need to be adjusted in the reward function. This non-parametric reward function is a generalization of the currently dominant parametric reward function with multiple weighting parameters [25,39,40], in which the battery SOC plays the role of weighting parameters. From Equation (19), it can be seen that when the battery is fully charged, with high battery SOC, the reward is mainly calculated based on the fuel consumption, which means that reducing the fuel consumption is the only way to obtain a higher reward. On the contrary, when the battery is running at low charge, with low battery SOC, the reward is mainly calculated based on the battery power, and this implicitly shows that reducing the battery charge state deviation is the primary way to gain an increased reward. Therefore, the proposed reward function does not require the adjustment of additional parameters. The architecture of the NPR-TD3 based strategy is shown in Figure 3.

4. Results and Discussion

4.1. Validation Condition and Method Comparison

The proposed non-parametric reward function TD3-based strategy was trained on two New European Driving Cycle (NEDC) driving cycles, and the testing-oriented driving cycle consisted of the Highway Fuel Economy Test (HWFET), the Urban Dynamometer Driving Schedule (UDDS), the Worldwide Light Vehicle Test Procedure (WLTP) Class3, and the China Light-duty Vehicle Test Cycle (CLTC), in an effort to cover all typical driving scenarios, including urban, suburb, and highway. Explicitly, the validating driving scenario is more complicated and has more road features in comparison with the training sets, so that the generality of the trained policy can be evaluated fairly. The training and validating driving cycles are shown in Figure 4a,b, respectively.
With the purpose of thoroughly evaluating the performance of the proposed strategy, the proposed strategy was cross-validated by comparing with two additional TD3-based baseline strategies. The involved methods differed mainly in state space and reward function. The characteristics of different strategies are summarized comparatively in Table 3. It should be noticed that Baseline II [40] stands for the cutting-edge techniques of current energy management research reported recently. In comparison, Baseline I and the proposed strategy are all enhanced methods, based on the most recently reported TD3 (Baseline II).

4.2. Results of Training

The variation of average reward and fuel consumption with training episodes under different strategies are illustrated comparatively in Figure 5 and Figure 6. The mature average rewards and fuel consumption (rewards and fuel consumption at the 100th episode) are illustrated in Table 4. It is pointed out that the training was repeated several times and the average was calculated to avoid contingencies. The results indicated that not only did the proposed strategy outperform Baseline I and II in convergence speed, but also had a larger mature average reward. The results also indicated that the proposed strategy had advantages over Baseline I and II in terms of learning efficiency and optimality. These advantages were due to the capability of the refined state space and non-parametric reward function that could boost the efficiency of the TD3 agent and reduce computational resource requirements. The above comparative verification demonstrated that the proposed strategy not only improved convergence efficiency but also optimality; therefore, demonstrating great promise for practical applications.

4.3. Validation of Energy Efficiency and Degradation

This section further validated the proposed strategy in regard to energy efficiency. The comparative performances of different strategies, in terms of fuel consumption and battery SOC retention, are shown in Figure 7. The performances of various EMSs are summarized in Table 5. It is clear from the results that the terminal SOC of the battery was almost equal to the initial SOC, so the fuel-saving rate, compared with the baseline strategies, was calculated as follows:
Δ B N = F B F N P R F B
where, Δ B N is the fuel-saving rate of the NPR-TD3 strategy compared with the baseline strategies; F B and F N P R are the total fuel consumptions of the vehicle operation based on the baseline strategies and NPR-TD3 strategy, respectively.
According to Equation (21), the energy performance of the proposed NPR-TD3 strategy was 3.66% and 8.58% higher than that of the Baseline I and II strategies in the training condition. From Figure 7b, it can be seen that the SOC trajectories of all three EMSs changed within a reasonable range of variation around the initial SOC value, which could guarantee that the battery would not be overcharged and over-discharged. In the “cold start” (0–850 s) and high-power demand (850–1250 s and 2100–2500 s) states, all three strategies had difficulty in keeping the real-time SOC near the reference value. However, the NPR-TD3 strategy had the smoothest SOC variation trajectory for the time period from 1250 s and 2100 s, which indicated that the NPR-TD3 strategy was more successful than Baseline I and Baseline II strategies in suppressing the deviation of SOC from its reference value. Overall, the NPR-TD3 strategy could suppress rapid charging and discharging of the battery, which was beneficial to reduce lithium-ion battery life degradation rate. The battery degradation control feature of the NPR-TD3 strategy was further validated in the testing conditions.
To further validate the optimality and adaptability of the NPR-TD3 control policy we evaluated the proposed strategy in more complicated testing conditions (Figure 4b). The optimization results are shown in Figure 8 and the detailed numerical values are shown in Table 6.
The fuel consumptions were 2158.60 g, 2263.28 g, and 2459.85 g for the NRF-TD3, Baseline I strategy and Baseline II strategy, respectively. The results reveal that the energy performance of the NRF-TD3 strategy was 4.63% and 12.25% superior to those of the Baseline I and Baseline II strategies, as calculated in the Equation (21). The battery SOC curves are shown in Figure 8b and all three EMSs had similar SOC trajectories and ensured that the SOC fluctuated within a reasonable range. In terms of battery SOC deviation, the Baseline I strategy tended to charge the battery with a maximum SOC value close to 80%, while, on the contrary, the Baseline II strategy tended to discharge with a minimum SOC value below 30%. However, the battery SOC curve of the NRF-TD3 strategy lay between the other two strategies and was more successful than the other two strategies in mitigating the deviation of SOC from its reference value. From the above analysis, it can be seen that the proposed strategy achieved better fuel economy under the more complicated compound conditions (testing conditions), which reflected the advantages of the strategy in terms of adaptability.
The proposed strategy was also capable of alleviating battery life decay, which can be observed in Figure 9. The speed profile for the entire test-driving condition is displayed in Figure 4b with a total mileage of 66.24 km and an ambient temperature of 20 °C. The trajectory of the battery SOH change is shown in Figure 9a. The proposed strategy guaranteed the slowest battery aging. The Baseline II strategy, which represents the parametric reward function TD3 strategy, achieved a battery aging rate of about twice that of the proposed strategy.
The SOH degradation, with respect to surrounding temperature, is illustrated in Figure 9b. A consistent conclusion could be drawn that the severity of degradation for the various strategies was consistent with the results for battery temperature with accelerated degradation observed at elevated surrounding temperatures. In contrast, the strategy proposed in this paper could suppress battery aging over a wide window of surrounding temperatures, and its advantages became even more pronounced under high-temperature conditions.

5. Conclusions

A novel battery degradation constrained energy management strategy is proposed in this paper for the HEV. The TD3 algorithm is incorporated with state space refinement techniques and a non-parametric reward function to address the problems of states redundancy and the difficulty of setting the weighting parameters reasonably. The proposed strategy was evaluated against the cutting-edge technologies in three aspects: training effort, fuel economy, and battery degradation suppression. The major findings are as follows:
(1)
State redundancy is a major roadblock to the real-time implementation of DRL strategies, and state refinement techniques are a very promising approach to improve the learning efficiency of DRL strategies. In addition, the state space must be designed in accordance with the reward function.
(2)
The non-parametric reward function is able to cope with rapidly changing scenarios, which improves the optimality and adaptability of the proposed strategy by 12.25% compared with the parametric counterpart.
(3)
The proposed strategy compresses the degradation rate of battery SOH to about 50% of the degradation rate of Baseline II strategy.
In this work, all validations were performed on standard driving cycles without traffic information. In future planned work, the proposed strategy will incorporate the effect of traffic information on the energy performance of hybrid electric vehicles.

Author Contributions

Methodology, J.W.; Validation, J.W.; Data curation, J.W. and M.H.; Writing—original draft, J.W.; Writing—review & editing, F.Y., J.W. and C.D.; Supervision, C.D.; Funding acquisition, F.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Project of National Science Foundation of China [5197051430]. and it was also funded by the Key R&D project of Hubei Province, China (2022BAA074).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hannan, M.A.; Azidin, F.A.; Mohamed, A. Hybrid electric vehicles and their challenges: A review. Renew. Sustain. Energy Rev. 2014, 29, 135–150. [Google Scholar] [CrossRef]
  2. Wang, H.; Huang, Y.; He, H.; Lv, C.; Liu, W.; Khajepour, A. Energy management of hybrid electric vehicles. Model. Dyn. Control. Electrified Veh. 2018, 2018, 159–206. [Google Scholar] [CrossRef]
  3. Tie, S.F.; Tan, C.W. A review of energy sources and energy management system in electric vehicles. Renew. Sustain. Energy Rev. 2013, 20, 82–102. [Google Scholar] [CrossRef]
  4. Li, J.; Zhou, Q.; Williams, H.; Xu, H. Back-to-Back Competitive Learning Mechanism for Fuzzy Logic Based Supervisory Control System of Hybrid Electric Vehicles. IEEE Trans. Ind. Electron. 2020, 67, 8900–8909. [Google Scholar] [CrossRef]
  5. Liu, J.; Chen, Y.; Zhan, J.; Shang, F. Heuristic Dynamic Programming Based Online Energy Management Strategy for Plug-In Hybrid Electric Vehicles. IEEE Trans. Veh. Technol. 2019, 68, 4479–4493. [Google Scholar] [CrossRef]
  6. Peng, J.; He, H.; Xiong, R. Rule based energy management strategy for a series–parallel plug-in hybrid electric bus optimized by dynamic programming. Appl. Energy 2017, 185, 1633–1643. [Google Scholar] [CrossRef]
  7. Liu, T.; Yu, H.; Guo, H.; Qin, Y.; Zou, Y. Online Energy Management for Multimode Plug-In Hybrid Electric Vehicles. IEEE Trans. Ind. Inform. 2019, 15, 4352–4361. [Google Scholar] [CrossRef]
  8. Panday, A.; Bansal, H.O. Energy management strategy for hybrid electric vehicles using genetic algorithm. J. Renew. Sustain. Energy 2016, 8, 15701. [Google Scholar] [CrossRef]
  9. Zhou, Q.; He, Y.; Zhao, D.; Li, J.; Li, Y.; Williams, H.; Xu, H. Modified Particle Swarm Optimization with Chaotic Attraction Strategy for Modular Design of Hybrid Powertrains. IEEE Trans. Transp. Electrif. 2021, 7, 616–625. [Google Scholar] [CrossRef]
  10. Zhou, Q.; Guo, S.; Xu, L.; Guo, X.; Williams, H.; Xu, H.; Yan, F. Global Optimization of the Hydraulic-Electromagnetic Energy-Harvesting Shock Absorber for Road Vehicles with Human-Knowledge-Integrated Particle Swarm Optimization Scheme. IEEE/ASME Trans. Mechatron. 2021, 26, 1225–1235. [Google Scholar] [CrossRef]
  11. Kim, N.; Cha, S.; Peng, H. Optimal Control of Hybrid Electric Vehicles Based on Pontryagin’s Minimum Principle. IEEE Trans. Control Syst. Technol. 2011, 19, 1279–1287. [Google Scholar] [CrossRef] [Green Version]
  12. Ren, Y.; Wu, Z. Research on the Energy Management Strategy of Hybrid Vehicle Based on Pontryagin’s Minimum Principle. In Proceedings of the 2018 10th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 25–26 August 2018; IEEE: New York, NY, USA, 2018; pp. 356–361. [Google Scholar]
  13. Wang, H.; Xie, Z.; Pu, L.; Ren, Z.; Zhang, Y.; Tan, Z. Energy management strategy of hybrid energy storage based on Pareto optimality. Appl. Energy 2022, 327, 120095. [Google Scholar] [CrossRef]
  14. Li, Y.; Chen, B. Development of integrated rule-based control and equivalent consumption minimization strategy for HEV energy management. In Proceedings of the 2016 12th IEEE/ASME International Conference on Mechatronic and Embedded Systems and Applications (MESA), Auckland, New Zealand, 29–31 August 2016; IEEE: New York, NY, USA, 2016; pp. 1–6. [Google Scholar]
  15. Guan, J.; Chen, B. Adaptive Power Management Strategy Based on Equivalent Fuel Consumption Minimization Strategy for a Mild Hybrid Electric Vehicle. In Proceedings of the 2019 IEEE Vehicle Power and Propulsion Conference (VPPC), Hanoi, Vietnam, 14–17 October 2019; IEEE: New York, NY, USA, 2019; pp. 1–4. [Google Scholar]
  16. Huang, Y.; Wang, H.; Khajepour, A.; He, H.; Ji, J. Model predictive control power management strategies for HEVs: A review. J. Power Sources 2017, 341, 91–106. [Google Scholar] [CrossRef]
  17. Wang, J.; Hou, X.; Du, C.; Xu, H.; Zhou, Q. A Moment-of-Inertia-Driven Engine Start-Up Method Based on Adaptive Model Predictive Control for Hybrid Electric Vehicles with Drivability Optimization. IEEE Access 2020, 8, 133063–133075. [Google Scholar] [CrossRef]
  18. Hu, X.; Wang, H.; Tang, X. Cyber-Physical Control for Energy-Saving Vehicle Following with Connectivity. IEEE Trans. Ind. Electron. 2017, 64, 8578–8587. [Google Scholar] [CrossRef]
  19. Zhou, Q.; Zhang, Y.; Li, Z.; Li, J.; Xu, H.; Olatunbosun, O. Cyber-Physical Energy-Saving Control for Hybrid Aircraft-Towing Tractor Based on Online Swarm Intelligent Programming. IEEE Trans. Ind. Inform. 2018, 14, 4149–4158. [Google Scholar] [CrossRef] [Green Version]
  20. Hu, X.; Liu, T.; Qi, X.; Barth, M. Reinforcement Learning for Hybrid and Plug-In Hybrid Electric Vehicle Energy Management: Recent Advances and Prospects. IEEE Ind. Electron. Mag. 2019, 13, 16–25. [Google Scholar] [CrossRef] [Green Version]
  21. Liu, T.; Zou, Y.; Liu, D.; Sun, F. Reinforcement Learning of Adaptive Energy Management with Transition Probability for a Hybrid Electric Tracked Vehicle. IEEE Trans. Ind. Electron. 2015, 62, 7837–7846. [Google Scholar] [CrossRef]
  22. Zou, Y.; Liu, T.; Liu, D.; Sun, F. Reinforcement learning-based real-time energy management for a hybrid tracked vehicle. Appl. Energy 2016, 171, 372–382. [Google Scholar] [CrossRef]
  23. Shuai, B.; Zhou, Q.; Li, J.; He, Y.; Li, Z.; Williams, H.; Xu, H.; Shuai, S. Heuristic action execution for energy efficient charge-sustaining control of connected hybrid vehicles with model-free double Q-learning. Appl. Energy 2020, 267, 114900. [Google Scholar] [CrossRef]
  24. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2016, arXiv:1509.02971v6. [Google Scholar]
  25. Wu, J.; Wei, Z.; Liu, K.; Quan, Z.; Li, Y. Battery-Involved Energy Management for Hybrid Electric Bus Based on Expert-Assistance Deep Deterministic Policy Gradient Algorithm. IEEE Trans. Veh. Technol. 2020, 69, 12786–12796. [Google Scholar] [CrossRef]
  26. Zhou, Q.; Li, J.; Shuai, B.; Williams, H.; He, Y.; Li, Z.; Xu, H.; Yan, F. Multi-step reinforcement learning for model-free predictive energy management of an electrified off-highway vehicle. Appl. Energy 2019, 255, 113755. [Google Scholar] [CrossRef]
  27. Lian, R.; Peng, J.; Wu, Y.; Tan, H.; Zhang, H. Rule-interposing deep reinforcement learning based energy management strategy for power-split hybrid electric vehicle. Energy 2020, 197, 117297. [Google Scholar] [CrossRef]
  28. Ganesh, A.H.; Xu, B. A review of reinforcement learning based energy management systems for electrified powertrains: Progress, challenge, and potential solution. Renew. Sustain. Energy Rev. 2022, 154, 111833. [Google Scholar] [CrossRef]
  29. Zhu, J.G.; Sun, Z.C.; Wei, X.Z.; Dai, H.F. A new lithium-ion battery internal temperature on-line estimate method based on electrochemical impedance spectroscopy measurement. J. Power Sources 2015, 274, 990–1004. [Google Scholar] [CrossRef]
  30. Liu, K.; Li, Y.; Hu, X.; Lucu, M.; Widanage, W.D. Gaussian Process Regression with Automatic Relevance Determination Kernel for Calendar Aging Prediction of Lithium-Ion Batteries. IEEE Trans. Ind. Inform. 2020, 16, 3767–3777. [Google Scholar] [CrossRef] [Green Version]
  31. Sun, Z.; Wang, C.; Zhou, Q.; Xu, H. Sensitivity Study of Battery Thermal Response to Cell Thermophysical Parameters (No. 2021-01-0751). In Proceedings of the SAE WCX Digital Summit, Virtual, 12–15 April 2021. SAE Technical Paper. [Google Scholar]
  32. Lin, X.; Perez, H.E.; Mohan, S.; Siegel, J.B.; Stefanopoulou, A.G.; Ding, Y.; Castanier, M.P. A lumped-parameter electro-thermal model for cylindrical batteries. J. Power Sources 2014, 257, 1–11. [Google Scholar] [CrossRef]
  33. Mura, R.; Utkin, V.; Onori, S. Energy Management Design in Hybrid Electric Vehicles: A Novel Optimality and Stability Framework. IEEE Trans. Control Syst. Technol. 2015, 23, 1307–1322. [Google Scholar] [CrossRef]
  34. Ebbesen, S.; Elbert, P.; Guzzella, L. Battery State-of-Health Perceptive Energy Management for Hybrid Electric Vehicles. IEEE Trans. Veh. Technol. 2012, 61, 2893–2900. [Google Scholar] [CrossRef]
  35. Sanghi, N. Deep Reinforcement Learning with Python: With PyTorch, TensorFlow and OpenAI Gym; Apress: Berkeley, CA, USA, 2021. [Google Scholar]
  36. Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  37. Wiewiora, E.; Cottrell, G.W.; Elkan, C. Principled Methods for Advising Reinforcement Learning Agents. In Proceedings of the 20th International Conference, Washington, DC, USA, 21–24 April 2003; pp. 792e–799e. [Google Scholar]
  38. Silver, D.; Singh, S.; Precup, D.; Sutton, R.S. Reward is enough. Artif. Intell. 2021, 299, 103535. [Google Scholar] [CrossRef]
  39. Wei, H.; Zhang, N.; Liang, J.; Ai, Q.; Zhao, W.; Huang, T.; Zhang, Y. Deep reinforcement learning based direct torque control strategy for distributed drive electric vehicles considering active safety and energy saving performance. Energy 2022, 238, 121725. [Google Scholar] [CrossRef]
  40. Zhou, J.; Xue, S.; Xue, Y.; Liao, Y.; Liu, J.; Zhao, W. A novel energy management strategy of hybrid electric vehicle via an improved TD3 deep reinforcement learning. Energy 2021, 224, 120118. [Google Scholar] [CrossRef]
Figure 1. Power flow diagram of parallel HEVs.
Figure 1. Power flow diagram of parallel HEVs.
Energies 16 00074 g001
Figure 2. Architecture of TD3.
Figure 2. Architecture of TD3.
Energies 16 00074 g002
Figure 3. The architecture of NPR-TD3 based strategy.
Figure 3. The architecture of NPR-TD3 based strategy.
Energies 16 00074 g003
Figure 4. Driving cycles used for strategy training (a) and testing (b).
Figure 4. Driving cycles used for strategy training (a) and testing (b).
Energies 16 00074 g004
Figure 5. Average reward trajectories in the training process.
Figure 5. Average reward trajectories in the training process.
Energies 16 00074 g005
Figure 6. Fuel consumption trajectories in the training process.
Figure 6. Fuel consumption trajectories in the training process.
Energies 16 00074 g006
Figure 7. Trajectories of fuel consumption (a) and battery SOC (b) using different strategies in training conditions.
Figure 7. Trajectories of fuel consumption (a) and battery SOC (b) using different strategies in training conditions.
Energies 16 00074 g007
Figure 8. Fluctuation of the fuel consumption (a) and battery SOC (b) of different strategies under different testing conditions.
Figure 8. Fluctuation of the fuel consumption (a) and battery SOC (b) of different strategies under different testing conditions.
Energies 16 00074 g008
Figure 9. (a) SOH degradation rate with respect to different strategies; (b) SOH degradation rate with respect to different ambient temperatures.
Figure 9. (a) SOH degradation rate with respect to different strategies; (b) SOH degradation rate with respect to different ambient temperatures.
Energies 16 00074 g009
Table 1. Vehicle parameters.
Table 1. Vehicle parameters.
DescriptionValue
Vehicle mass1730 kg
Engine power70 kW
Motor power112 kW
Battery energy capacity12 kWh
Frontal area2.2 m2
Gear ratio[2.725 1.5 1 0.71]
Final drive ratio3.27
Wheel radius0.275 m
Table 2. Pre-factor related to C-rate.
Table 2. Pre-factor related to C-rate.
cr0.52610
B(cr)31,63021,68112,93415,512
Table 3. Comparative Features of Different Strategies.
Table 3. Comparative Features of Different Strategies.
StrategyState SpaceReward Function
Proposed strategy { V s p d , T r q d m d , m f , I b a t t } Non-parametric RF
Baseline I { V s p d , T r q d m d , m f , S O C , S O H } Non-parametric RF
Baseline II { V s p d , T r q d m d , m f , I b a t t } Parametric RF
Table 4. Comparative Features of Different Strategies.
Table 4. Comparative Features of Different Strategies.
StrategyConvergence EpisodesAverage RewardFuel (g)
Proposed16−3691.17525.43
Baseline I38−3746.75545.41
Baseline II28−3760.86574.74
Table 5. Comparison among various EMSs in the training condition.
Table 5. Comparison among various EMSs in the training condition.
EMSFuel (g)Terminal SOC (%)
Proposed525.4363.03
Baseline I545.4163.21
Baseline II574.7463.90
Table 6. Comparison among various EMSs in the testing conditions.
Table 6. Comparison among various EMSs in the testing conditions.
EMSFuel (g)Terminal SOC (%)
Proposed2158.6062.69
Baseline I2263.2870.30
Baseline II2459.8559.69
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yan, F.; Wang, J.; Du, C.; Hua, M. Multi-Objective Energy Management Strategy for Hybrid Electric Vehicles Based on TD3 with Non-Parametric Reward Function. Energies 2023, 16, 74. https://doi.org/10.3390/en16010074

AMA Style

Yan F, Wang J, Du C, Hua M. Multi-Objective Energy Management Strategy for Hybrid Electric Vehicles Based on TD3 with Non-Parametric Reward Function. Energies. 2023; 16(1):74. https://doi.org/10.3390/en16010074

Chicago/Turabian Style

Yan, Fuwu, Jinhai Wang, Changqing Du, and Min Hua. 2023. "Multi-Objective Energy Management Strategy for Hybrid Electric Vehicles Based on TD3 with Non-Parametric Reward Function" Energies 16, no. 1: 74. https://doi.org/10.3390/en16010074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop