1. Introduction
Hybrid electric vehicle (HEV) technology is regarded as a dependable vehicle technology that balances long-endurance mileage and low energy usage. HEVs share the burden of the internal combustion engine (ICE) with batteries and electric motors, allowing the ICE to run more effectively and enhancing fuel economy. Therefore, the hybrid system must balance multiple control objectives, such as fuel economy and charging maintenance, thus making the energy management strategy (EMS) key in determining all aspects of HEVs’ performance.
Research on HEVs has spawned massive EMSs, which can be classified into three groups: rule-based EMSs, optimization-based EMSs, and learning-based EMSs [
1]. Rule-based strategies control the operation of the power component in real-time using a well-developed heuristic or fuzzy rules [
2,
3]. These methods are distinguished by low computing costs and excellent reliability, resulting in widespread use in engineering practice [
4]. However, because the formulation of control rules relies too heavily on engineers’ knowledge and necessitates time-consuming experiments for calibration, rule-based EMSs have a lengthy development cycle and low fuel-saving performance and self-adaptability [
5]. Consequently, research on optimization-algorithm-based strategies that use optimization algorithms to achieve optimal control of HEVs has been substantially encouraged. Global optimization and real-time optimization are two categories of optimization-based strategies, depending on whether the optimization time domain is global or local [
6]. Typical global optimization methods include genetic algorithms [
7], particle swarm optimization algorithms [
8], and dynamic programming (DP) [
9], which can find the optimal global solution of the objective function by offline computation. The DP approach, in particular, has been acknowledged as a standard for investigating the optimal energy optimization for HEVs. However, global optimization techniques rely on prior knowledge of the driving conditions of the future, making it impossible to use them in real-world scenarios [
10]. Therefore, real-time optimization method research has received much attention due to its practical viability. For instance, the equivalent consumption minimization strategy (ECMS), which accomplishes torque distribution by finding the least of the sum of the equivalent fuel consumption and the actual fuel consumption at each time step, is a greedy method with proven engineering applications [
11]. Unfortunately, the ECMS has a weak immunity to perturbations of the equivalent factor, which leads to its poor adaptability and stability [
12]. Despite the emergence of various adaptive ECMSs [
13], the sensitivity issue with the equivalency factor has yet to be fully resolved. Another well-liked real-time optimum energy management approach is model predictive control (MPC), which combines optimization algorithms with predictive approaches to achieve rolling optimization control of HEVs [
14]. The correctness of the predictive model is crucial to the MPC’s control performance since it determines the system’s torque distribution by solving a predictive model [
15]. However, the development of MPC-based EMSs is hampered by the difficulty of balancing accuracy and stability in present prediction methods.
To identify new control approaches that can address the disadvantages mentioned above, learning-based EMSs, particularly those based on deep reinforcement learning (DRL), have recently received much attention [
16]. Since the energy management problem for HEVs can be described as the Markov decision process (MDP), the DRL technique can be successfully used for the energy-efficient control of HEVs [
17]. The DRL strategy uses neural networks to learn torque distribution, which can improve energy efficiency by interacting with the environment. Therefore, the DRL strategy is a neural network black box with end-to-end state inputs and action outputs, which implies the strategy only needs a small amount of computation to obtain the optimal output [
18]. DRL EMSs fall into two categories: discrete control (DC) DRL strategies and continuous control (CC) DRL strategies, depending on the control form [
19]. Deep Q-network (DQN)-based, Double DQN-based, and Dueling DQN-based methods are the most emblematic DC DRL approaches [
20,
21,
22]. The core of these methods is to iterate a Q-value function
using neural networks that can evaluate state-action pairs
according to a temporal-differential (TD) approach so that the actions are selected at each step by
[
23]. Given that DC DRL methods can only handle actions that are independent of one another due to the way they learn, continuous physical space actions like engine power must be discretized. However, higher discretization accuracy makes it challenging to converge the DC DRL strategy, while lower discretization accuracy causes a significant discretization error [
24]. As a result, it is more challenging to employ DC DRL methods to optimize the energy efficiency of HEVs, which encourages the development of CC DRL strategies that can avoid the action discretization issue. The deep deterministic policy gradient (DDPG)-based strategy, the twin delayed DDPG (TD3)-based strategy, and the soft actor-critic (SAC)-based strategy are all representative CC DRL EMSs that learn torque distribution using actor-critic (AC) technology [
25,
26,
27]. Specifically, these strategies use actor neural networks to select output actions within the continuous action space and the critic neural networks to learn value functions that can evaluate the actor. Under the guidance of critics, the actor updates in the direction that can obtain a greater value evaluation through the gradient descent method. With the AC structure, CC DRL strategies are more accurate than DC DRL strategies at determining the optimal action since they can perform any action belonging to a continuous action interval [
28]. To demonstrate the effectiveness of CC DRL EMSs, numerous studies have been carried out. Li et al. [
29] discovered, for instance, that the DDPG-based strategy performed better than the MPC-based strategy in terms of fuel-saving performance and computational speed without knowledge of future travels. According to Zhou et al. [
30], compared to the DQN-based strategy, the double DQN-based strategy, and the dueling DQN-based strategy, the TD3-based technique converges more quickly, is more adaptable, and uses less fuel. Wu et al. [
31] indicated that the SAC-based strategy has excellent learning ability. It can reduce the training time by 87.5% and improves fuel economy by up to 23.3% compared to the DQN-based strategy.
Nevertheless, the CC DRL strategies from the current studies still have unsolved issues. The first one is that the critic networks of common CC DRL strategies suffer from large Q-value estimation bias, which can cause poor learning stability and extreme sensitivity to the hyperparameters of the strategies [
32]. As a result, it usually takes a great deal of time to assess the effect of various combinations of hyperparameters. Another issue is the lack of security in CC DRL strategies. It is due to the fact that CC DRL strategies would execute aimless exploration over the whole action space to avoid getting trapped in a local optimum, and thus the strategies would output actions that violate the physical constraints of the powertrain system. This means that there may be an unreasonable torque distribution between the engine and the motor/generator, which can cause irreversible damage to the power components. Besides, the initialization of the CC DRL strategies is usually random, so the strategies require long exploration periods to accumulate experience, which leads to a high training cost. Transfer reinforcement learning techniques are expected to solve this problem, i.e., some knowledge from already learned strategies can be transferred to new strategies through transfer learning (TL), which speeds up learning by improving the initial performance of new strategies. However, there is a lack of research on transfer reinforcement learning in the EMS field.
To cope with the above problems, this study adopts a cutting-edge CC DRL algorithm, namely softmax deep double deterministic policy gradients (SD3), to formulate EMS for a power-split hybrid electric urban bus (HEUB). Based on the traditional AC method, SD3 can not only avoid the overestimation and underestimation of the Q-value through the double Q-learning technique, Boltzmann softmax technique, and dual-actor technique, but also can improve the sampling efficiency and learning stability. Therefore, the SD3 algorithm has great potential in the EMS field. The main contributions of this research are as follows:
The current EMS field lacks research related to the SD3 algorithm, and, to the authors’ knowledge, this is a pioneering research work related to the SD3-based strategy;
In order to prevent the SD3-based strategy from outputting torque assignments that do not adhere to the physical limits of the powertrain system in stochastic exploration, an action masking method that does not go against the algorithmic concept is proposed. This work can drive DRL-based strategies toward engineering applications and be an inspiration for the improvement of other DRL algorithms;
The possibility of utilizing TL methods to quicken SD3 learning is investigated. That is, part of the prior knowledge of the already converged strategy is migrated to the new driving cycle for initializing the new strategy, which can avoid the cold start of the new strategy in the new environment.
The paper is organized as follows: In
Section 2, the modeling of the studied HEUB and its associated energy management problems are introduced. In
Section 3, the system design of the SD3-based strategy, the invalid action masking method, and the transfer learning technique are described. In
Section 4, the SD3-based strategy experiments and the strategy’s performance are discussed numerically under several metrics. Conclusions and some suggestions for future work are given in
Section 5.
3. Methods and Design of SD3-Based EMS
The DRL problem can be described by an MDP defined as a 5-tuple
[
37], where
S denotes the state space of the environment,
A denotes the action space of the agent,
T denotes the transfer probability between states,
r denotes the reward function, and
denotes the discount factor. In the MDP, the DRL algorithm continuously updates the policy function to optimize the action output based on the reward feedback obtained from the interaction with the environment. The policy function
is a mapping from states to actions, and its performance can be evaluated by a state-value function
or a
Q-value function
. Therefore, DRL algorithms usually find the optimal policy function
that maximizes the cumulative reward by iterating over the optimal value function.
Since the energy management problem of HEVs is a sequential control problem, it can be mathematically described as an MDP. In the energy management MDP, the driving cycle and the powertrain system are part of the environment. The DRL algorithm searches for an optimal strategy that satisfies the energy optimization objective by trying different torque distribution strategies. This section proposes a novel DRL-based strategy, namely the SD3-based strategy, and its algorithm principles and formulation details are presented in detail.
3.1. Preliminary Formulation of SD3-Based EMS
3.1.1. Brief Review of SD3
SD3 is an off-policy CC DRL algorithm that iterates the optimal
Q-value function and policy function by the AC method [
32]. Specifically, SD3 uses neural networks called actor and critic to approximate the policy function
and
Q-value function
, respectively. The actor network is the interactive end of SD3 for action selection. At the same time, the critic network is used to estimate the
Q-value and thus guide the actor network to learn the strategy that maximizes the
Q-value by the gradient method. With this clever design, SD3 can handle problems with a continuous action space and avoid discretization errors. On this basis, to improve the critic’s estimation performance and the actor’s sampling efficiency, SD3 draws on the experience of DQN and double
Q-learning. It integrates the dual-actor technique, dual-critic technique, and target network technique in the AC method [
23,
38]. That means the learning framework of SD3 needs to integrate eight neural networks, including two actor networks, two critic networks, two target actor networks, and two target critic networks. Among them, only the actor and critic networks need to be trained with parameters, while the target network is updated by soft copying the weights of the actor and critic networks. It is worth mentioning that the actor and critic networks are trained using an experience replay method. That means that SD3 deposits transfer samples
consisting of state
, action
, immediate reward
, next state
, and done flag
generated from each interaction with the environment into the experience buffer
, and then randomly samples mini-batches of samples from the experience buffer to train the network. This practice can improve the utilization of samples and attenuate the correlation between training samples.
As for the critic network, SD3 uses the clipped double
Q-learning method integrated with the Boltzmann softmax operator to estimate the
Q-value and construct the TD error. Namely, the loss function of the critic network is as follows:
where
where
where
denotes the next action;
denotes the target actor networks
and
with weights
and
, respectively;
denotes the noise obeying normal distribution, which is clipped to the interval
;
denotes the target critic networks
and
with weights
and
, respectively;
denotes the probability density function of the normal distribution;
denotes the parameters of the softmax operator;
denotes the critic networks
and
with weights
and
. Note that Equation (9) uses the minimum of the two target critic networks to initially estimate the Q-value. This technique, called clipped double Q-learning, can mitigate the overestimation of values. Equation (8) yields an unbiased estimate of the softmax operator by importance sampling, and the estimated Q-value is further processed. This technique can smooth the optimization landscape to limit the estimation error to a certain range.
As for the two actor networks, their learning objective is to maximize the Q-value, so they are updated according to the output value of the critic network using the deterministic policy gradient method, i.e.,
where
denotes the actor networks
and
with weights
and
, respectively.
The target networks do not need to be trained, and the parameters are optimized by soft updates, i.e.,
where
is the soft update factor. Usually,
is set to ensure that the target networks change smoothly so that stable learning targets can be provided for the critic networks. Based on the above description, the pseudo-code of the full SD3 algorithm is provided in Algorithm 1.
Algorithm 1: SD3 |
Initialize critic networks , , and actor networks , with random parameters , , , |
Initialize target networks , , , and |
Initialize replay buffer |
forepisode = 1 to E do |
for t = 1 to T do |
Observe state s and select action a with exploration noise according to dual-actor: |
Execute action a, observe reward r, next state s’, and done d |
Store transition tuple (s, a, r, s’, d) in |
for i = 1, 2 do |
Randomly sample a mini-batch of N transitions from |
|
|
|
|
Update the critic according to Bellman loss: |
Update actor by policy gradient: |
Update target networks: |
end for |
end for |
end for |
3.1.2. Reward, Observation, Action, and Parameters Setting
The closed-loop control framework of the SD3-based strategy is shown in
Figure 4, and its deployment consists of three main parts: reward design, environment construction (including state design and action design), and parameter setting of the SD3 algorithm.
Reward: In formulating the SD3-based strategy, the SD3 algorithm searches the energy management strategy that minimizes Equation (5) by controlling the HEUB interaction with the environment. Therefore, the multi-objective reward function of the SD3-based strategy should be designed as . The multi-objective reward function’s main challenge is the parameter tuning between and . The purpose of parameter tuning is to minimize fuel use while maintaining SOC. Different weights between them will result in a variety of learning outcomes.
Observation: The SD3-based strategy’s input, the state observation, is used to present fundamental information about the environment. The observation amounts should be comprehensive and independent of one another. Therefore, vehicle speed, acceleration, and SOC are set as state observations in this study, i.e., .
Actions: After observing the state, the SD3-based strategy uses the actor networks to choose a suitable action to decide the system’s torque distribution. According to the optimal BSFC curve discussed in the previous section, the ICE power determines the ICE speed and torque with the best efficiency so that the demand speed and demand torque of the motor can be further determined based on the mechanical characteristics of the system. Thus, the SD3-based strategy can optimize energy control by controlling the power output of the ICE, and the action space is defined as .
Parameter: In this study, all networks of SD3 consist of fully connected layers, including three hidden layers with 512, 256, and 256 nodes, respectively. Among them, the unit layers of the critic networks are activated by the ReLu function. The ReLu function also activates the input and intermediate layers of the actor networks, but the tanh function activates the output layers. Since the tanh function maps the output of the actor networks to (−1,1), the linear transformation of the action space is required. The essential hyperparameters, which were chosen only after several experiments, are also listed in
Table 2. Among them, the discount factor
, taken as 0.99, means that long-term payoff can be taken into account, thereby increasing the likelihood that SD3 will learn the globally optimal policy [
39]. The lower learning rate of the actor networks than that of the critic networks can make the iteration of the Q-function relatively faster, stabilizing the policy update. The larger buffer capacity can accumulate more experience to prevent policy overfitting, and an appropriate mini-batch size is beneficial for the learning stability and efficiency of the algorithm.
3.2. Tips for Improving SD3-Based Strategy
3.2.1. Action Masking Technology
In the studied powertrain, MG2′s role is to share the ICE load or charge the battery by brake recovery, so it must have a wide physical operating range to ensure a perfect match with the ICE. Instead, the MG1′s role is to regulate the ICE speed or absorb some of the ICE power to produce electricity. Hence, its physical operating range can be relatively narrow. Given this, the studied HEUB is equipped with a high-power motor MG2 and a low-power motor MG1 for the ICE. Nevertheless, Equation (2) states that when the ICE’s demand speed or torque is too high, the demand speed or torque of the MG1 is also likely to be too high, which would cause the MG1 to violate Equation (5). Namely, the ICE should only operate within a specific range
at each time step
t to guarantee the reasonable operation of MG1. This indicates that the output action of the SD3 should fall on
, the part of the BSFC curve that intersects with
. Meanwhile, because the BSFC curve’s speed and torque are positively associated, the higher the power at each point, the higher the speed and torque. Therefore,
is continuous, as illustrated in
Figure 5, with the upper limit of the maximum feasible power
and the lower limit of the minimum feasible power
. In conclusion, the output action of SD3 should be limited to
and
.
Unfortunately, like other DRL-based EMSs, the SD3 would explore the entire action space aimlessly in learning to reduce the risk of falling into a local optimum. That means that SD3 will output actions outside of the . Therefore, it is necessary to design action masking (AM) to filter invalid actions and prevent SD3 from unnecessary exploration in learning. Given that the essence of SD3 is to develop strategies with long-term planning through the state distribution, action distribution, and state transition of the learning environment, AM needs to follow two criteria to prevent the violation of the principles of the SD3 algorithm, thus affecting SD3′s learning performance. The first one is that AM would not change the original action space distribution so that the environment’s potential state transfer probability function will not be destroyed. The second is that samples containing invalid actions will not be collected into the experience replay buffer, so the wrong samples will not participate in the training. Specifically, SD3 implements AM by repeating the following steps at each time step t.
At each time step t, the is first calculated by the following three steps: (1) The action space is discretized to obtain ; (2) Calculate the and by traversing according to the dynamics of the powertrain system (similar to ECMS); and (3) Obtain ;
Then, the ICE power output from the actor network in the SD3-based strategy is restricted to by the clip operation, i.e., , since the clip operation does not change the original action space and thus does not have any effect on the policy.
It is essential to point out that the first step is widely used by many mathematical model-based approaches, such as DP, ECMS, and MPC, which all exclude invalid actions by traversing the action space. Also, the AM technique needs to be deployed for both the actor networks and the target actor networks. Otherwise, the learning ability of the algorithm will be seriously affected. Besides, the way to mask invalid actions by the clip operation is only applicable to algorithms like DDPG, TD3, and SD3, which are based on the AC and deterministic policy. For Q-value-based algorithms like DQN, it is possible to ensure that invalid actions are not selected by argmax operations by setting the Q-value of the invalid action to negative infinity. For algorithms such as proximal policy optimization and SAC, which are based on the AC and randomness policy, action masking can be achieved by adjusting the probability of invalid actions being sampled to zero.
3.2.2. Transfer Learning Technology
The DRL algorithm suffers from sample inefficiency, which requires interaction with the environment to acquire many samples to learn the strategy. However, once the environment changes, the previous learning results become invalid and it is necessary to retrain the neural network at a vast cost [
40]. Transfer learning is an effective means to solve this problem, and the core idea is to use the experience gained by the model in the old task to improve the learning performance on a related but different task [
41].
For HEUBs deployed with the same SD3-based strategy, their rewards, states, actions, powertrain characteristics, and algorithm settings are identical for different urban driving conditions. It implies a correlation between the optimal EMS for different driving conditions. Therefore, after learning in the source driving cycle , some parameters of the actor networks and critic networks can be transferred to the target driving cycle to accelerate the learning. Specifically, the knowledge transfer of the SD3-based strategy from to is divided into three steps:
The first step is to extract the parameters of the actor networks and critic networks of the SD3-based strategy that have been sufficiently converged in ;
Then, the extracted parameters are used to initialize the parameters of the corresponding networks of the SD3-based strategy in , and freezing is implemented for the input and intermediate layers of the networks;
Finally, the output layers of the networks are randomly initialized, and the networks are fine-tuned by a small amount of training.