A Comparative Study of Energy Management Strategies for Battery-Ultracapacitor Electric Vehicles Based on Different Deep Reinforcement Learning Methods

Xu, Wenna; Huang, Hao; Wang, Chun; Xia, Shuai; Gao, Xinmei

doi:10.3390/en18051280

Open AccessArticle

A Comparative Study of Energy Management Strategies for Battery-Ultracapacitor Electric Vehicles Based on Different Deep Reinforcement Learning Methods

by

Wenna Xu

¹,

Hao Huang

^1,*,

Chun Wang

^1,2

,

Shuai Xia

¹ and

Xinmei Gao

^1,2

¹

School of Mechanical Engineering, Sichuan University of Science and Engineering, Yibin 644000, China

²

Sichuan Provincial Key Lab of Process Equipment and Control, Sichuan University of Science and Engineering, Yibin 644000, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(5), 1280; https://doi.org/10.3390/en18051280

Submission received: 10 February 2025 / Revised: 25 February 2025 / Accepted: 3 March 2025 / Published: 5 March 2025

(This article belongs to the Section E: Electric Vehicles)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

An efficient energy management strategy (EMS) is crucial for the energy-saving and emission-reduction effects of electric vehicles. Research on deep reinforcement learning (DRL)-driven energy management systems (EMSs) has made significant strides in the global automotive industry. However, most scholars study only the impact of a single DRL algorithm on EMS performance, ignoring the potential improvement in optimization objectives that different DRL algorithms can offer under the same benchmark. This paper focuses on the control strategy of hybrid energy storage systems (HESSs) comprising lithium-ion batteries and ultracapacitors. Firstly, an equivalent model of the HESS is established based on dynamic experiments. Secondly, a regulated decision-making framework is constructed by uniformly setting the action space, state space, reward function, and hyperparameters of the agent for different DRL algorithms. To compare the control performances of the HESS under various EMSs, the regulation properties are analyzed with the standard driving cycle condition. Finally, the simulation results indicate that the EMS powered by a deep Q network (DQN) markedly diminishes the detrimental impact of peak current on the battery. Furthermore, the EMS based on a deep deterministic policy gradient (DDPG) reduces energy loss by 28.3%, and the economic efficiency of the EMS based on dynamic programming (DP) is improved to 0.7%.

Keywords:

hybrid energy storage system; energy management strategy; deep reinforcement learning; energy loss

1. Introduction

With the depletion of oil resources and increasing climate variability, the problem of energy has grabbed the attention of a variety of industries worldwide. In order to address climate risk and promote the use of clean energy, the goal of peaking carbon emissions and achieving carbon neutrality has been unanimously recognized by all fields [1]. Compared to other power sources, lithium-ion batteries offer benefits like high energy density, minimal self-discharge, and resistance to memory effects [2]. Therefore, lithium-ion batteries are widely used as the main energy source in electric vehicles [3,4,5,6]. However, electric vehicles relying solely on lithium-ion batteries face difficulties in handling high-rate currents and fluctuating driving conditions, leading to faster battery aging and a reduced lifespan [7,8]. On the other hand, ultracapacitors have strong instantaneous power output capabilities. For this reason, combining the two components to form a hybrid energy storage system (HESS) and implementing efficient energy management strategies (EMSs) can reduce the adverse effects of high-rate currents on lithium-ion batteries. This approach not only improves battery performance but also enhances the reliability of battery pack operation [9,10,11]. As a result, the EMS is essential for enhancing power distribution efficiency. This paper aims to investigate the impact of different DRL algorithms (DQN and DDPG) on energy efficiency, battery lifespan, and power distribution under identical training conditions.

The primary types of EMSs are rule-based and optimization-based approaches. The rule-based EMSs are widely used in electric vehicles, which have the characteristics of small computational complexity and high reliability [12,13,14]. However, due to the complex dynamic characteristics of HESSs, a rule-based EMS is difficult to adjust online with real driving conditions, which will result in worse control performance of the HESS. Moreover, rule-based EMSs rely heavily on the experience of engineering designers. In addition, the optimization-based EMSs can be further divided into global optimization methods and real-time optimization approaches. Dynamic programming (DP) is a well-known global optimization strategy with excellent control performance [15]. In contrast to rule-based EMSs, the optimization-based EMSs have better performances in lower energy consumption for the HESS. However, the methods involve significant computational costs and may encounter challenges, including the curse of dimensionality and discretization errors, making them unsuitable for practical applications. Therefore, the optimization-based EMSs are often used as reference points to evaluate the performance of other EMSs. To realize the optimal power allocation in HESSs, Ref. [16] utilized the outcomes of DP to enhance the adaptive rule-based EMSs. The simulation results showed that this strategy can protect the battery excellently and reduce overall vehicle energy loss under unknown driving conditions. Combined wavelet transformation, neural networks, and fuzzy logic train offline neural network models, which use data sets obtained from wavelet decomposition [17]. Then, they use the trained models to predict the low-frequency power demand of the battery and achieve real-time efficient power allocation for the HESS. Moreover, the real-time optimization-based EMSs are generally divided into an equivalent consumption minimization strategy (ECMS) [18], model predictive control (MPC) [19], and adaptive equivalent consumption minimization strategy (A-ECMS) [20]. In Ref. [21], a combination of MPC and Pontryagin’s minimum principle is presented to carry out energy management for the HESS. A method proposed an online EMS for electric vehicles based on the ECMS [22]. Compared to existing ECMS methods, the strategy can reduce fuel consumption by 8–14%. Although these real-time optimization strategies can reduce the complexity of system calculation to a certain extent, they are prone to getting stuck in local optima during the solving process, which restricts the full potential of vehicle performance.

With the fast development of internet technology and artificial intelligence (AI) algorithms, reinforcement learning (RL) algorithms have demonstrated remarkable decision-making capabilities in actual engineering applications. These types of AI algorithms can obtain the optimal EMSs for HESSs with unknown system structures and parameters. Generally, the RL algorithms can be divided into traditional RL and deep reinforcement learning (DRL). As a typical representative of traditional RL, Q-Learning has been extensively employed in various industries. The EMS of Q-Learning is used to conduct a comprehensive analysis and research on the HESS in Refs. [23,24]. However, the training process of Q-Learning is highly unstable owing to the necessity of discretizing the state and action spaces. To ensure the stability of the energy allocation process, Ref. [25] proposed a two-stage EMS based on Q-Learning. The results demonstrated that, compared with the recent EMSs, the training time and average absolute error were reduced by 23% and 20%, respectively. Nevertheless, for the sake of adapting to actual driving conditions, the EMS based on the deep Q-network algorithm for HESSs is proposed in the study of [26]. The battery capacity degradation was reduced by 26.36% under the guaranteeing favorable fuel economy for electric vehicles. Ref. [27] employed a hierarchical deep Q-Learning (DQL-H) algorithm to determine the best solution for the EMS. The proposed hierarchical algorithm DQL-H addresses the challenge of limited feedback during training while also enhancing training efficiency and reducing fuel consumption. The problem is that the Q value of deep Q network (DQN) during the training process is overestimated. A novel Double DQN-based EMS is introduced in Ref. [28], which shows a method that can achieve the purpose of cost saving by converting discrete state parameters into continuous ones. Analysis of the simulation results showed that the policy could further decrease costs by 5.5% and reduce training time by 93.8%. Another EMS framework based on Double DQN for HESSs is also constructed [29]. The framework was designed to address the issues of traditional control strategies and RL. The experimental results indicated that the proposed strategy significantly improves vehicle fuel economy. However, highly discretized state-action spaces not only cause sharp changes in control algorithm dimensions but also increase convergence difficulties. To address this challenge, DRL algorithms with an Actor–Critic structure have been widely employed to handle high-dimensional continuous state-action spaces. The deep deterministic policy gradient (DDPG) method and transfer learning aim to optimize the EMS of HESSs [30]. The simulation results illustrated that it exhibits superior early performance and quicker convergence, with strong robustness and adaptability. A hierarchical EMS based on a DDPG is proposed in Ref. [31]. In the upper-level strategy, the DDPG algorithm employs historical operating condition information to generate the State of Charge (SOC) for future driving segments. Meanwhile, the lower-level strategy uses a long short-term memory (LSTM) neural network algorithm to forecast the vehicle’s short-term speed. The analysis results revealed that the method contributes to improving the overall vehicle fuel economy.

A comprehensive literature review indicates that EMSs based on DRL have gained significant attention in recently research. However, there are still notable challenges that impede further advancements in this domain. It is widely accepted that advanced DRL algorithms can improve the performance of EMSs. Nevertheless, the lack of standardized benchmarks for comparing different EMSs based on various DRL algorithms is a major obstacle. Many studies only validate that DRL algorithms with special parameters outperform traditional RL algorithms in particular driving conditions for electric vehicles. However, there is a lack of unified testing benchmarks to compare various EMSs. This limitation complicates the assessment and comparison of the performance of various EMSs based on DRL algorithms. It also hinders the development and progress of this research field. Future research should establish standardized benchmarks to facilitate the comparison and evaluation of diverse EMSs. This will lay the groundwork for future progress in the field of EMSs based on DRL.

Therefore, in order to meet the challenges, the total energy loss is taken as the optimization control target for HESSs. The differences in EMS performance under varying DRL methods are explained by analyzing two DRL algorithms and their principal frameworks. The key contributions of this paper are outlined below: (1) Two DRL algorithms and their schematics are presented, and systematic comparative experiments are conducted of EMSs for HESSs. (2) By comparing with the different DRL methods on the EMSs of electric vehicles under the same benchmark, this paper highlights the future directions for improving DRL-based EMSs.

The structure of this paper is as follows. Section 2 establishes the power system models of the HESS. Section 3 presents two DRL-based EMSs. Simulation experiment results of different strategies are analyzed and discussed in Section 4. Section 5 summarizes the conclusions of this research.

2. The Modeling of the HESS and Component Selection

The modeling of the HESS serves as the foundation for designing EMSs. Before the energy optimization for electric vehicles, the topology structure of the HESS was introduced. Subsequently, the required power of the vehicle was derived based on the dynamic model. Models for the lithium-ion battery and ultracapacitor were established to determine their respective output powers. Finally, the efficiency model of the DC/DC converter was introduced.

2.1. HESS Topology

The topology structure of the HESS plays a significant role in formulating EMSs for electric vehicles. Therefore, considering the computational accuracy and control efficiency of the topology structure, a semi-active topology structure was chosen, as depicted in Figure 1. The lithium-ion battery pack was arranged in series with the DC/DC converter and parallel to the ultracapacitor pack. The DC/DC converter regulates the voltage between the terminals and the ultracapacitor pack, utilizing its inherent characteristics. The HESS provided power to the electric vehicle, with the lithium-ion battery pack serving as the primary power source. The output power of the battery pack was directly applied to the vehicle requirement, which will ensure voltage stability on the HESS. The ultracapacitor bank serves as an auxiliary power source, with its output power connected to the vehicle requirement through a DC/DC converter. This topology structure allows the ultracapacitors to operate within a wider voltage range, preventing peak currents from damaging the lithium-ion batteries. Consequently, it achieves the purpose of protecting the lithium-ion batteries and extending the lifespan of electric vehicles.

2.2. Vehicle Dynamic Model

The required performance model is established by the vehicle dynamic equilibrium equation, which consists of driving resistance, acceleration resistance, gradient resistance and air resistance. The essential feature parameters of the electric vehicle in this research are listed in Table 1, and the necessary power of the vehicle can be determined by Equation (1).

P_{req} = \frac{V_{a}}{3600 n_{t}} (m g f \cos (α) + \frac{C_{D} A {V_{a}}^{2}}{21.15} + m gsin (α) + δ m \frac{d V_{a}}{dt})

(1)

where

P_{req}

represents the required power;

V_{a}

is the driving speed;

m

and

f

denote the vehicle mass and the correction coefficient of the rotation mass, respectively;

δ

denotes the correction coefficient of the rotation mass;

n_{t}

,

n_{m}

and

n_{d}

are the efficiency of the transmission system, the motor, and the DC/AC converter, respectively; and

g

and

α

denote the gravity acceleration and the angle of the road.

2.3. Battery and Ultracapacitor Model

The models of batteries and ultracapacitors play a crucial role in EMS design and can be classified into electrochemical models, artificial neural network-based models, and equivalent circuit models. Among these, the equivalent circuit model is commonly used for modeling due to its simplicity and minimal computational requirements. Therefore, as shown in Figure 2, the Thevenin model and the Rint model were selected to establish the battery and ultracapacitor models, respectively, where the open-circuit voltage of the battery is represented as OCV; U_t is the terminal voltage;

R_{i}

denotes the DC internal resistance; and

R_{d}

and

C_{d}

represents the internal resistance and polarization capacitance of the battery, respectively.

U_{oc}

, R_p, and U_p represent the open-circuit voltage, DC internal resistance, and terminal voltage of the ultracapacitor, respectively.

Based on Kirchhoff’s voltage law, the circuit equation of Thevenin model and Rint model can be derived in Equations (2) and (3):

\{\begin{matrix} {\dot{U}}_{d} = \frac{i_{L}}{C_{d}} - \frac{U_{d}}{R_{d} C_{d}} \\ U_{t} = O C V - U_{d} - i_{L} R_{i} \end{matrix}

(2)

U_{p} = U_{oc} - i_{uc} R_{p}

(3)

The discrete circuit equation of the models can be expressed in Equation (4) and Equation (5), respectively.

\{\begin{matrix} U_{d} (k) = U_{d} (k - 1) e x p^{- Δ t / τ} + (1 - e x p^{- Δ t / τ}) i_{L, k - 1} R_{d} \\ U_{t} (k) = O C V (k) - U_{d} (k) - i_{L, k} R_{i} \end{matrix}

(4)

U_{p} (k) = U_{oc} (k) - i_{uc, k} R_{p}

(5)

where

Δ t

is the scan period time,

k

indicates the sampling time, and

τ = R_{d} \times C_{d}

, which represents the time constant.

According to the definition of the SOC of the battery and ultracapacitor, the SOC estimation of them can be described as Equation (6).

Z (t) = Z (t_{0}) - \frac{\int_{t_{0}}^{t} η i_{L} d t}{C_{a}}

(6)

where

C_{a}

is the rated capacity of the battery,

η

indicates the charging and discharging efficiency of the battery, and

Z (t_{0})

is the original SOC of the battery.

2.4. DC/DC Converter Modeling

Owing to the intricacies of the DC/DC model, there is a huge amount of calculation when applying it to EMSs. Therefore, to enhance the efficiency of the calculations, the linear interpolation method was employed for modeling purposes. Table 2 [32] presents the efficiency of the DC/DC converter as a function of current and power, expressed as

ε_{dc} = F (I_{d c}, P_{d c})

. This efficiency function can be derived from experimental measurements or obtained from efficiency curves provided by the manufacturer. By considering the DC/DC converter efficiency, the EMS can optimize power allocation and enhance the overall system efficiency.

2.5. Battery and Ultracapacitor Experiments

To build battery and ultracapacitor models, experiments with dynamic characteristics are necessary. In this study, a 2 Ah-3.7 V NMC lithium-ion battery and a 2.7 V-1500 F ultracapacitor were chosen as the experimental objects. To maintain the internal properties of the battery and ultracapacitor at different temperatures, they were placed in a temperature chamber and the capacity (CAP), hybrid pulse power characterization (HPPC), and urban dynamometer driving schedule (UDDS) experiments at 10 °C, 25 °C, and 40 °C were performed. The actual capacity of the battery and ultracapacitor at various temperatures was taken as the average of three CAP tests. HPPC experiments were conducted to determine the parameters of the battery and ultracapacitor models, while UDDS tests were used to simulate the dynamic power demands of the battery and ultracapacitor pack under real driving conditions. The UDDS test in this study aimed to assess the accuracy of the established models. Data at 25 °C are presented here, as the experimental results at the three temperatures exhibited similar trends. The HPPC and UDDS results at 25 °C are shown in Figure 3.

2.6. Parameter Identification and Precision Validation

Efficient intelligent optimization algorithms have a huge impact on the effect of parameter identification. In this paper, the sine-cosine optimization algorithm was used to find the optimal parameters of the battery model and the ultracapacitor model. The voltage data and circuit data received from the HPPC test at various SOC reference points were selected. Then, the SOC feature points were selected for OCV fitting, and the model parameters under the remaining SOC points were obtained. These parameters show the inner properties of the battery and the ultracapacitor at different SOCs. The established models were used to describe the dynamic output of the battery pack and the ultracapacitor pack during driving conditions. The parameters of models include

R_{i}

,

R_{d}

,

τ

and R_p, respectively.

Furthermore, to reduce the amount of calculation, a linear interpolation method was used to obtain model parameters at SOC values that are not directly measured. The model parameters for the battery and ultracapacitor at different SOCs are listed in Table 3. The battery used in the experiments is a 3.6 V-2 Ah 18650 nickel-cobalt-manganese ternary lithium-ion battery, while the ultracapacitor is a 2.7 V-1500 F ultracapacitor. These parameters can be utilized to accurately describe the performance of the battery and ultracapacitor under various driving cycles.

In order to precisely assess the battery and ultracapacitor models’ accuracy, the dynamic driving conditions were used to verify the precision of the models. Additionally, to sufficiently utilize the battery performance, battery pack SOC was usually restricted to above 0.2. Therefore, during the accuracy validation of the battery and ultracapacitor models, the SOC range for the battery was set to [0.2, 1], and the SOC range for the ultracapacitor was set to [0.1, 1]. The errors of the battery and ultracapacitor models under UDDS are shown in Figure 4. It can be observed that the maximum error of the Thevenin model for the battery at 25 °C is 30.8 mV, and the maximum error for the ultracapacitor at 25 °C is 11.6 mV. When the battery charge and discharge rate is under 2 C rate, the terminal voltage error remains within 20 mV. It is during the high rate charge and discharge phase that the battery model experiences the most errors. Consequently, when the charge–discharge rate of the battery is lower than 2 C rate, it is not only advantageous to prolong the battery life but also to improve the precision of other state variables. For these reasons, the battery and ultracapacitor models obtained by the sine-cosine optimization algorithm have high accuracy.

2.7. Parameter Matching of the Battery Pack and Ultracapacitor Pack

Important for the dynamic characteristics of the HESS is the matching of the parameters of the battery pack and the ultracapacitor pack. A reasonable parameter-matching result can prolong the service life of EMSs. Additionally, to reduce computational burden and improve the efficiency of the HESS, this study ignores the inconsistency between the battery and ultracapacitor. The identification results of the parameters were determined by establishing models for the battery pack and ultracapacitor pack, as depicted in Equation (7). Therefore, the result of parameter matching is that the battery pack consists of 100 batteries in series and 25 batteries in parallel. The number of series connections in the ultracapacitor group is 135, and the number of parallel connections is 3. The specific details are shown in Table 4 [33].

\{\begin{matrix} {BAT}_{R_{i}} = R_{i} \times \frac{N_{bat}}{S_{bat}} \\ {BAT}_{R_{d}} = R_{d} \times \frac{N_{bat}}{S_{bat}} \\ {UC}_{R_{p}} = R_{c} \times \frac{N_{uc}}{S_{uc}} \end{matrix}

(7)

where

{BAT}_{R_{i}}

and

{UC}_{R_{p}}

represent the DC internal resistance of the battery pack and ultracapacitor pack;

{BAT}_{R_{d}}

indicates the internal battery pack polarization resistance;

N_{bat}

and

S_{bat}

are the series number and parallel number of the battery pack;

N_{uc}

and

S_{uc}

represent the series number and parallel number of the ultracapacitor pack.

3. DRL-Based Energy Management Strategy

3.1. Reinforcement Learning

Reinforcement learning (RL) provides continuous interactive learning between agents and the environment through powerful data expression capabilities. Then, the optimized EMS is obtained to maximize the cumulative reward [34]. The RL algorithm consists of two main components: the environment and the agent as shown in Figure 5. The agent receives the state from the environment at each sampling time and selects the optimal action based on its current policy. At this time, the environmental state is updated, while the agent will also receive a reward value. Based on the reward value fed back from the environment, the agent estimates the performance of the EMS, so as to improve the EMS excellently. Moreover, the optimized EMS is obtained after the persistent training of the agent. Finally, the EMS is used for the power distribution and optimal control of electric vehicles under actual driving cycles.

3.2. Deep Q Network

Combining the strengths of deep neural networks and target networks with traditional Q-Learning, the DQN algorithm is a representative DRL method. Therefore, DQN effectively overcomes the limitations of RL applications with these advantages. In recent years, the DQN algorithm has been employed for the EMS in HESSs [35]. As shown in Figure 6, it illustrates the DQN-based EMS optimization framework, which includes a replay buffer, Q network, and target network. The replay buffer acts as a repository for storing the experiences of the agent, which consists of the transition tuples of state, action, reward, and next state. The Q network, a deep neural network, approximates the Q-function and maps state–action pairs to their corresponding Q-values. Conversely, the target network is utilized to periodically update the target Q-values, which is convenient to facilitate the stabilization of the agent’s learning process. In the proposed DQN-based EMS framework, the agent selects an action based on the Q network after receiving the input state so as to obtain the current reward and next state. Moreover, the agent periodically samples a random batch of experiences from the replay buffer for training purposes. The Q-network is updated by decreasing the loss between the predetermined Q-values and the target Q-values. Meanwhile, the target net is updated regularly with the evaluation net weights.

The discretized table is employed to represent all possible state–action combinations. And the value function

Q (s, a)

is updated with optimal actions to ensure that the current action–value function remains maximized.

Q (s, a) \leftarrow α [r + λ \underset{a^{'}}{m a x} Q (s^{'}, a^{'}) - Q (s, a)] + Q (s, a)

(8)

where

s

and

a

denote the current state and action,

s^{'}

and

a^{'}

represent the state and action at next time step,

α

, and

r

and

λ

signify the learning rate, reward, and discount factor.

However, the trained state-space table limits the dimension of the variable space. which can potentially lead to discretization errors and the curse of dimensionality. Therefore, DQN employs neural networks to approximate the action-value function

Q (s, a)

:

Q (s, a | θ) \approx Q (s, a)

(9)

where

θ

denotes the parameters of the neural network. Moreover, the loss function of DQN is represented as Equation (10):

L (θ) = E [(r + γ m a x Q (s^{'}, a^{'} | θ_{τ}) - Q (s, a | θ))^{2}]

(10)

The experience replay approach is used to eliminate the dependencies between the monitored data and to smooth the distributional changes. Then, the control policy is implemented by choosing the action with the maximal state action value, and the policy function is represented as Equation (11):

π (s_{t}) = \underset{a_{t}}{a r g m a x} Q (s_{t}, a_{t})

(11)

3.3. Deep Deterministic Policy Gradient

Admittedly, the discretizational state–action space constrains the capability of DQN. However, the EMS based on the DDPG algorithm can deal with the continuous state–action space, which is depicted in Figure 7. It comprises two neural network architectures and an experience replay buffer. These two neural networks are known as the Actor network and the Critic network. Each network is composed of two sub-networks: an evaluation network and a target network [36]. The DDPG utilizes a dual-network structure consisting of an Actor network and a Critic network to effectively address the challenges of a high-dimensional continuous state and action control problems. Additionally, the DDPG can also accelerate the learning process of the agent and reduce the convergence time. Based on the assessment of an action value using a neural network, the DDPG algorithm employs an Actor network

θ^{μ}

to approximate the policy function and generate continuous actions. In the DDPG algorithm, the Actor–Critic network is composed of deep neural networks and is responsible for parameter updates. The Actor network

θ^{μ}

cooperates with the environment to produce control actions, while the critic network Q evaluates the behavior of the Actor network and guides the selection of the next action based on Bellman’s equation. Additionally, the target networks

θ^{Q^{'}}

and

θ^{μ^{'}}

introduce delayed updates to network parameters learned from

θ^{Q}

and

θ^{μ}

, which are used to improve training stability.

The Actor network interacts with the vehicle environment and stores

s_{t}

,

a_{t}

,

r_{t}

, and

s_{t + 1}

in the experience replay buffer. The experience replay buffer randomly samples small batches of

s_{i}

,

a_{i}

,

r_{i}

, and

s_{i + 1}

, which are then fed into the Actor and Critic network. The Critic target network utilizes the actions

μ^{'} (s_{i + 1})

given by the Actor target network to calculate the expected target return Q-value.

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1}))

(12)

where

Q^{'}

is the Critic target network, and

μ^{'}

is the Actor target network. With the target Q value, the Critic loss can be expressed by Equation (13):

L (θ_{Q}) = \frac{1}{M} \sum_{i = 1}^{M} {[y_{i} - Q (s_{i}, a_{i} | θ_{Q})]}^{2}

(13)

where

θ_{Q}

represents the parameters of the current Critic network,

M

denotes the number of minibatches,

y_{i}

refers to the Q-value of the Critic target network, and

Q

represents the current Critic network.

The action-value function

Q (s, a)

maps the current state to a specific action and updates the parameters of the Actor network. The loss gradient of the Actor network is described as Equation (14):

𝛻_{θ_{μ}} J \approx \frac{1}{M} \sum_{i = 1}^{M} [𝛻_{d} Q (s_{i}, a | θ_{Q}) |_{a = μ (s_{i} | θ_{μ})} \cdot 𝛻_{θ_{μ}} μ (s_{i} | θ_{μ})]

(14)

where

𝛻

represents the gradient,

θ_{μ}

denotes the parameters of the current Actor network, and

μ

refers to the current Actor network.

In order to improve the stability of learning, the target network updates softly and slowly with a small hyperparameter ε. The soft update can be represented as Equation (15):

\{\begin{matrix} θ^{'} \leftarrow ε θ + (1 - ε) θ^{'} \\ θ^{μ^{'}} \leftarrow ε θ^{μ} + (1 - ε) θ^{μ^{'}} \end{matrix}

(15)

where

θ^{'}

indicates the parameters of the Critic target network,

θ

indicates the parameters of the Critic online network, and

θ^{μ^{'}}

indicates the parameters of the Actor target network. As an algorithm that combines the target network, the experience replay buffer, and the actor–critic, the sample efficiency is improved.

3.4. Confirm State Variables and Action Variable

In the DRL-based EMS, the power assignment of the HESS is decided by the agent’s reaction to the vehicle’s external surroundings. When the agent receives the state information, it will make the corresponding action according to the strategy that has been learned. Therefore, the selection of action variables and the environmental state are crucial for the training of the agent. By eliminating the dependence on various state parameters of HESSs, the state space for the intelligent agent is defined as the

P_{req}

,

S O C_{bat}

, and

S O C_{uc}

. The

P_{req}

represents the actual power demand of the HESS, while the SOC of the battery and ultracapacitor determine the driving range of the vehicle. In addition, the agent interacts with the environment and uses the output current of the battery as a control action. The states of various components in the HESS model are updated after deciding the battery current. Therefore, the variables are defined as Equation (16), and Table 5 shows the ranges of these values. To ensure the validity of the comparison, the state space, action space, reward function and hyperparameters during the training process are kept consistent for the two different DRL agents.

\{\begin{array}{l} s_{t} = [\begin{matrix} P_{req}, S O C_{bat}, S O C_{uc} \end{matrix}] \\ a_{t} = I_{bat} \end{array}

(16)

3.5. Reward Function Settings

The reward is crucial for the agent in the training progress, which can greatly decide the rate of convergence of the algorithm. Therefore, the design of the reward function plays a significant role in the DRL-based EMSs. The reward function in this paper includes battery energy loss

L_{bat}

, ultracapacitor energy loss

L_{uc}

, DC/DC converter energy loss

L_{dc}

, and ultracapacitor penalty

G_{uc}

, as shown in Equation (17).

\begin{array}{l} r_{t} = - (L_{bat} + L_{uc} + L_{dc} + G_{uc}) \\ \{\begin{array}{l} L_{bat} (k) = I_{bat}^{2} (k) \times R_{i} (k) + U_{d}^{2} (k) / R_{d} (k) \\ L_{uc} (k) = I_{uc}^{2} (k) \times R_{p} (k) \\ L_{dc} (k) = P_{bat} (k) \times (1 - ε_{dc} (k)) ε_{dc}^{- W} (k) \\ G_{uc} (k) = β {(S O C_{ucref} - S O C_{uc})}^{2} \end{array} \end{array}

(17)

where

P_{bat}

represents the load power of the battery pack;

ε_{dc}

denotes the efficiency of the DC/DC converter;

β

and

S O C_{ucref}

represents the balance factor, and the reference values of the ultracapacitor

S O C_{uc}

, respectively.

W

is a logical value.

W

is set to 0; when the battery discharges, on the contrary, it is 1.

3.6. Training Setup

As shown in Figure 8, the UDDS driving cycle is taken as the data set for training in this paper. It is worth noting that the hyperparameter setting of DRL algorithms has a major impact on their efficiency. For example, the learning rate is a parameter used to control the step size of the current gradient descent direction. It determines the magnitude of adjustment for model parameters in each update, and excessively high or low learning rates can lead to poor training performance. The discount factor is an important parameter used to balance current rewards and future rewards. Specifically, the discount factor shows the decay rate of future rewards when computing cumulative rewards. The value of the discount coefficients in this study is 0.995. The experience replay buffer, if the buffer value is too small, may reduce the efficiency of training certain samples, but too big a buffer value will prolong the time of old patterns in the experience buffer, which is not conducive to policy optimization. Furthermore, the purpose of the minibatch size is to achieve a compromise between stability and convergence speed in the training process, and its value is linked to the size of the replay buffer of the experience. Hence, multiple rounds of simulation and reliable experience are needed to determine the value of the hyperparameters. In order to demonstrate the comparative effect visually, network architecture and all the public hyperparameters are set to the same benchmark parameters. The key parameters are listed in Table 6.

The key hyperparameters in this study, such as the learning rate, experience replay buffer size, and discount factor, were adjusted based on experimental needs to balance training efficiency and model performance. For instance, a smaller learning rate for the Actor network helps achieve stable policy updates, while a higher learning rate for the Critic network accelerates the convergence of the value function. A larger experience replay buffer size was chosen to increase sample diversity and ensure the model’s stability during long training sessions. The discount factor allows for an appropriate balance between long-term and short-term rewards, optimizing the overall policy.

4. Simulation Results and Discussion

To visually compare the impact of different EMSs based on DRL in electric vehicles, rule-based EMS, DP-based EMS, DQN-based EMS and DDPG-based EMS are selected for comparison. The DP-based optimization strategy with outstanding global optimization characteristics is used to evaluate the performance of other EMSs. The comparison results of the

{SOC}_{bat}

,

{SOC}_{uc}

,

I_{bat}

and

I_{uc}

under different EMSs are shown in Figure 9. In additional, to provide a better comparison of the effects of different EMSs, Table 7 summarizes some key features of the HESS in the overall control process.

Figure 9a,b represent the change curves of the battery

{SOC}_{bat}

and ultracapacitor

{SOC}_{uc}

, respectively. Because of the low capacity of the ultracapacitor, it is mainly used to provide power demand. Therefore, the economies in different EMSs are decided by the terminal of the battery. As shown in Table 7, the terminals

{SOC}_{bat}

for DP-based, DDPG-based, DQN-based, and rule-based EMSs are 0.3465, 0.3452, 0.3124, and 0.3146 under four UDDS driving cycles. These terminal SOC values are averaged across multiple driving cycles to ensure robustness. In comparison to the terminal

{SOC}_{bat}

of DP-based EMS, the differences in the terminal SOC for DDPG-based, DQN-based, and rule-based EMSs are 0.0013, 0.0341 and 0.0319, respectively. This indicates that, under the same state-action space, reward function, and training hyperparameters, the gap between the DDPG-based EMS and the DP-based EMS is reduced to 0.37%, and the economy is improved by 10.49% compared to the DQN-based EMS.

Additionally, from Figure 9c,d, it can be known that the diversity in the EMSs at the HESS caused significant disparities in the energy allocation of electric vehicles. The EMS controlled by the DRL approach effectively limits the maximum current under the driving cycle utilization of the unique features of ultracapacitors in power applications, while the DQN-based EMS terminal

{SOC}_{bat}

improves the utilization of the ultracapacitor. Table 7 lists the maximum charging current of the battery and ultracapacitor under variable EMSs. The DDPG-based EMS can effectively reduce the impact of peak current on the battery and extend the battery’s lifespan.

In addition, the variation curves of battery power and ultracapacitor power are described in Figure 9e,f. It can be understood that the majority of regenerative energy is absorbed by the ultracapacitor. This means that the DDPG-based EMS can further reduce the fluctuation amplitude of the battery output power. Therefore, it can effectively maintain the stability of the lithium-ion battery’s output and decrease the driving costs of electric vehicles. Moreover, Figure 10 shows the energy losses under different EMSs, with specific parameters shown in Table 8. The DDPG-based EMS cut down the total energy losses to 0.7% compared to the DP-based EMS under the same benchmark. Compared to the DQN-based EMS, the energy loss gap achieved by the DDPG-based EMS is 40.4%, indicating a higher economic efficiency of the DDPG-based EMS.

5. Conclusions

In this paper, two EMSs based on the DRL algorithm are designed for electric vehicles. To investigate the impact of the variable DRL algorithm on the EMSs for HESSs, the same benchmark is set to compare and analyze the different performances of each EMS. The simulation experiment results demonstrate that the DDPG-based EMS can allocate the output power better of various components in the HESS. In comparison to the DQN-based EMS and the rule-based EMS, the DQN-based EMS and the DDPG-based EMS improve the economic efficiency by 28.3% and 33.6%, respectively. Furthermore, the energy loss gap between the DDPG-based EMS and DP-based EMS is reduced to 0.7%. Finally, the DQN-based EMS maximizes ultracapacitor efficiency in recovering regenerative energy under varying driving conditions. Meanwhile, the DDPG-based EMS can restrain the peak current of the lithium battery, which demonstrates the adaptability of the DDPG-based EMS.

In future research, additional enhancements will be made to overcome some of the limitations of the current work. For example, the intelligent agents in DRL algorithms are highly sensitive to the setting for hyperparameters, which will reduce the efficiency of data interaction. Additionally, the influence of temperature and aging on the EMS is not considered in the proposed method. Therefore, in the upcoming work, factors such as aging status, temperature status, and traffic conditions will also be incorporated into the DRL-based EMSs to improve the management performance of HESS.

Author Contributions

Conceptualization, W.X. and H.H.; methodology, W.X.; software, W.X. and X.G.; validation, W.X. and C.W.; formal analysis, S.X.; investigation, X.G. and S.X.; resources, C.W. and S.X.; data curation, W.X. and S.X.; writing—original draft preparation, W.X. and S.X.; writing—review and editing, H.H.; visualization, H.H. and X.G.; supervision, H.H. and C.W.; project administration, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Sichuan (Grant No. 2024NSFSC0145), the Sichuan Provincial Key Lab of Process Equipment and Control (Grant No. GK202307), and the Graduate Innovation Foundation of Sichuan University of Science and Engineering (Grant No. Y2023090) and supported by the Scientific Research and Innovation Team Program of Sichuan University of Science and Engineering (Grant No. SUSE652A004). The systemic experiments were performed at the Advanced Energy Storage and Application (AESA) Group, Beijing Institute of Technology.

Data Availability Statement

The original data can be obtained from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, F. Green Energy and Intelligent Transportation-Promoting Green and Intelligent Mobility. Green Energy Intell. Transp. 2022, 1, 100017. [Google Scholar] [CrossRef]
Selvakumar, R.B.; Vivekanandan, C.; Sharma, H.; Vekariya, V.; Varma, R.A.; Mohanavel, V.; Ramkumar, G.; Kumar, A.M.; Abdullah-Al-Wadud, M. Energy Management of a Dual Battery Energy Storage System for Electric Vehicular Application. Comput. Electr. Eng. 2024, 115, 109099. [Google Scholar] [CrossRef]
He, H.; Sun, F.; Wang, Z.; Lin, C.; Zhang, C.; Xiong, R.; Deng, J.; Zhu, X.; Xie, P.; Zhang, S.; et al. China’s Battery Electric Vehicles Lead the World: Achievements in Technology System Architecture and Technological Breakthroughs. Green Energy Intell. Transp. 2022, 1, 100020. [Google Scholar] [CrossRef]
Mohammadi, F.; Saif, M. A Comprehensive Overview of Electric Vehicle Batteries Market. e-Prime Adv. Electr. Eng. Electron. Energy 2023, 3, 100127. [Google Scholar] [CrossRef]
Zhao, M.; Zhang, Y.; Wang, H. Battery Degradation Stage Detection and Life Prediction Without Accessing Historical Operating Data. Energy Storage Mater. 2024, 69, 103441. [Google Scholar] [CrossRef]
Saldaña, G.; Martín, J.I.S.; Zamora, I.; Asensio, F.J.; Oñederra, O.; González-Pérez, M. Empirical Calendar Ageing Model for Electric Vehicles and Energy Storage Systems Batteries. J. Energy Storage 2022, 55, 105676. [Google Scholar] [CrossRef]
Sarvaiya, S.; Ganesh, S.; Xu, B. Comparative Analysis of Hybrid Vehicle Energy Management Strategies with Optimization of Fuel Economy and Battery Life. Energy 2021, 228, 120604. [Google Scholar] [CrossRef]
Tao, F.; Gong, H.; Fu, Z.; Guo, Z.; Chen, Q.; Song, S. Terrain Information-Involved Power Allocation Optimization for Fuel Cell/Battery/Ultracapacitor Hybrid Electric Vehicles via an Improved Deep Reinforcement Learning. Eng. Appl. Artif. Intell. 2023, 125, 106685. [Google Scholar] [CrossRef]
Xu, B.; Zhou, Q.; Shi, J.; Li, S. Hierarchical Q-Learning Network for Online Simultaneous Optimization of Energy Efficiency and Battery Life of the Battery/Ultracapacitor Electric Vehicle. J. Energy Storage 2022, 46, 103925. [Google Scholar] [CrossRef]
Yan, M.; Xu, H.; Li, M.; He, H.; Bai, Y. Hierarchical Predictive Energy Management Strategy for Fuel Cell Buses Entering Bus Stops Scenario. Green Energy Intell. Transp. 2023, 2, 100095. [Google Scholar] [CrossRef]
Qiu, Y.; Zeng, T.; Zhang, C.; Wang, G.; Wang, Y.; Hu, Z.; Yan, M.; Wei, Z. Progress and Challenges in Multi-Stack Fuel Cell System for High Power Applications: Architecture and Energy Management. Green Energy Intell. Transp. 2023, 2, 100068. [Google Scholar] [CrossRef]
Liu, Y.; Liu, J.; Zhang, Y.; Wu, Y.; Chen, Z.; Ye, M. Rule Learning Based Energy Management Strategy of Fuel Cell Hybrid Vehicles Considering Multi-Objective Optimization. Energy 2020, 207, 118212. [Google Scholar] [CrossRef]
Rodriguez, R.; Trovão, J.P.-F.; Solano, J. Fuzzy Logic-Model Predictive Control Energy Management Strategy for a Dual-Mode Locomotive. Energy Convers. Manag. 2022, 253, 115111. [Google Scholar] [CrossRef]
Xu, M.; Wang, R.; Zhao, P.; Wang, X. Fast Charging Optimization for Lithium-Ion Batteries Based on Dynamic Programming Algorithm and Electrochemical-Thermal-Capacity Fade Coupled Model. J. Power Sources 2019, 438, 227015. [Google Scholar] [CrossRef]
Liu, C.; Wang, Y.; Wang, L.; Chen, Z. Load-Adaptive Real-Time Energy Management Strategy for Battery/Ultracapacitor Hybrid Energy Storage System Using Dynamic Programming Optimization. J. Power Sources 2019, 438, 227024. [Google Scholar] [CrossRef]
Zhang, Q.; Wang, L.; Li, G.; Liu, Y. A Real-Time Energy Management Control Strategy for Battery and Supercapacitor Hybrid Energy Storage Systems of Pure Electric Vehicles. J. Energy Storage 2020, 31, 101721. [Google Scholar] [CrossRef]
Tian, X.; Cai, Y.; Sun, X.; Zhu, Z.; Xu, Y. An Adaptive ECMS with Driving Style Recognition for Energy Optimization of Parallel Hybrid Electric Buses. Energy 2019, 189, 116151. [Google Scholar] [CrossRef]
He, H.; Wang, Y.; Han, R.; Han, M.; Bai, Y.; Liu, Q. An Improved MPC-Based Energy Management Strategy for Hybrid Vehicles Using V2V and V2I Communications. Energy 2021, 225, 120273. [Google Scholar] [CrossRef]
Brunelli, L.; Capancioni, A.; Canè, S.; Cecchini, G.; Perazzo, A.; Brusa, A.; Cavina, N. A Predictive Control Strategy Based on A-ECMS to Handle Zero-Emission Zones: Performance Assessment and Testing Using an HiL Equipped with Vehicular Connectivity. Appl. Energy 2023, 340, 121008. [Google Scholar] [CrossRef]
Du, R.; Hu, X.; Xie, S.; Hu, L.; Zhang, Z.; Lin, X. Battery Aging- and Temperature-Aware Predictive Energy Management for Hybrid Electric Vehicles. J. Power Sources 2020, 473, 228568. [Google Scholar] [CrossRef]
Wang, Z.; Wei, H.; Xiao, G.; Zhang, Y. Real-Time Energy Management Strategy for a Plug-In Hybrid Electric Bus Considering the Battery Degradation. Energy Convers. Manag. 2022, 268, 116053. [Google Scholar] [CrossRef]
Ahmadian, S.; Tahmasbi, M.; Abedi, R. Q-Learning Based Control for Energy Management of Series-Parallel Hybrid Vehicles with Balanced Fuel Consumption and Battery Life. Energy AI 2023, 11, 100217. [Google Scholar] [CrossRef]
Ghaderi, R.; Kandidayeni, M.; Boulon, L.; Trovão, J.P. Q-Learning Based Energy Management Strategy for a Hybrid Multi-Stack Fuel Cell System Considering Degradation. Energy Convers. Manag. 2023, 293, 117524. [Google Scholar] [CrossRef]
Syu, J.-H.; Lin, J.C.-W.; Fojcik, M.; Cupek, R. Q-Learning Based Energy Management System on Operating Reserve and Supply Distribution. Sustain. Energy Technol. Assess. 2023, 57, 103264. [Google Scholar] [CrossRef]
Ye, Y.; Wang, H.; Xu, B.; Zhang, J. An Imitation Learning-Based Energy Management Strategy for Electric Vehicles Considering Battery Aging. Energy 2023, 283, 128537. [Google Scholar] [CrossRef]
Qi, C.; Zhu, Y.; Song, C.; Yan, G.; Xiao, F.; Da, W.; Zhang, X.; Cao, J.; Song, S. Hierarchical Reinforcement Learning Based Energy Management Strategy for Hybrid Electric Vehicle. Energy 2022, 238, 121703. [Google Scholar] [CrossRef]
Wu, P.; Partridge, J.; Anderlini, E.; Liu, Y.; Bucknall, R. Near-Optimal Energy Management for Plug-In Hybrid Fuel Cell and Battery Propulsion Using Deep Reinforcement Learning. Int. J. Hydrogen Energy 2021, 46, 40022–40040. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, T.; Hong, J.; Zhang, H.; Yang, J.; Jia, Q. Double Deep Q-Network Guided Energy Management Strategy of a Novel Electric-Hydraulic Hybrid Electric Vehicle. Energy 2023, 269, 126858. [Google Scholar] [CrossRef]
Xu, J.; Li, Z.; Du, G.; Liu, Q.; Gao, L.; Zhao, Y. A Transferable Energy Management Strategy for Hybrid Electric Vehicles via Dueling Deep Deterministic Policy Gradient. Green Energy Intell. Transp. 2022, 1, 100018. [Google Scholar] [CrossRef]
Liu, X.; Yang, C.; Meng, Y.; Zhu, J.; Duan, Y.; Chen, Y. Hierarchical Energy Management of Plug-In Hybrid Electric Trucks Based on State-of-Charge Optimization. J. Energy Storage 2023, 72, 107999. [Google Scholar] [CrossRef]
Xiong, R.; Cao, J.; Yu, Q. Reinforcement Learning-Based Real-Time Power Management for Hybrid Energy Storage System in the Plug-In Hybrid Electric Vehicle. Appl. Energy 2018, 211, 538–548. [Google Scholar] [CrossRef]
Liu, F.; Wang, C.; Luo, Y. Parameter Matching Method of a Battery-Supercapacitor Hybrid Energy Storage System for Electric Vehicles. World Electr. Veh. J. 2021, 12, 253. [Google Scholar] [CrossRef]
Zou, Y.; Liu, T.; Liu, D.; Sun, F. Reinforcement Learning-Based Real-Time Energy Management for a Hybrid Tracked Vehicle. Appl. Energy 2016, 171, 372–382. [Google Scholar] [CrossRef]
Wang, H.; He, H.; Bai, Y.; Yue, H. Parameterized Deep Q-Network Based Energy Management with Balanced Energy Economy and Battery Life for Hybrid Electric Vehicles. Appl. Energy 2022, 320, 119270. [Google Scholar] [CrossRef]
Alfaverh, F.; Denaï, M.; Sun, Y. Optimal Vehicle-to-Grid Control for Supplementary Frequency Regulation Using Deep Reinforcement Learning. Electr. Power Syst. Res. 2023, 214, 108949. [Google Scholar] [CrossRef]
Zhang, D.; Li, J.; Guo, N.; Liu, Y.; Shen, S.; Wei, F.; Chen, Z.; Zheng, J. Adaptive deep reinforcement learning energy management for hybrid electric vehicles considering driving condition recognition. Energy 2024, 313, 134086. [Google Scholar] [CrossRef]

Figure 1. The topology of HESS.

Figure 2. The model based on an equivalent circuit: (a) Battery. (b) Ultracapacitor.

Figure 3. The results of HPPC and UDDS experiments: (a,b) Battery. (c,d) Ultracapacitor.

Figure 4. The results of precision validation: (a) Battery. (b) Ultracapacitor.

Figure 5. The structure of reinforcement learning.

Figure 6. DQN-based EMS optimization control framework.

Figure 7. DDPG-based EMS optimization control framework.

Figure 8. Driving cycle of UDDS. (a) The velocity of UDDS (b) The required power of UDDS.

Figure 9. The comparison results of battery and ultracapacitor under different EMSs: (a) battery

{SOC}_{bat}

; (b) ultracapacitor

{SOC}_{uc}

; (c) battery current; (d) ultracapacitor current; (e) battery power; and (f) ultracapacitor power.

Figure 9. The comparison results of battery and ultracapacitor under different EMSs: (a) battery

{SOC}_{bat}

; (b) ultracapacitor

{SOC}_{uc}

; (c) battery current; (d) ultracapacitor current; (e) battery power; and (f) ultracapacitor power.

Figure 10. Comparison of energy loss in DRL-EMSs.

Table 1. Electric vehicle essential parameter.

Parameters	Values
Vehicle mass ( $m$ )	1845 (kg)
Roll resistance coefficient ( $f$ )	0.025
Air resistance coefficient ( $C_{D}$ )	0.36
Vehicle frontal area ( $A$ )	2.53 (m²)
Correction coefficient of the rotation mass ( $δ$ )	1.03
The efficiency of the transmission system ( $n_{t}$ )	0.91
The efficiency of the motor ( $n_{m}$ )	0.95
The efficiency of the DC/AC convertor ( $n_{d}$ )	0.95
Gravity acceleration ( $g$ )	9.8 (m²)
The angle of the road ( $α$ )	0
Vehicle mass ( $m$ )	1845 (kg)
Roll resistance coefficient ( $f$ )	0.025

Table 2. DC/DC convertor efficiency table.

$ε_{dc} (I_{d c}, P_{d c})$	0	5 kW	10 kW	20 kW	30 kW	40 kW	50 kW	≥120 kW
0 A	50%	50%	50%	50%	50%	50%	50%	50%
5 A	63%	67%	71%	73%	74%	73%	72%	72%
10 A	75%	84%	92%	95%	97%	95%	94%	94%
50 A	73%	82%	91%	93%	96%	93%	92%	92%
100 A	72%	80%	88%	91%	95%	92%	91%	91%
150 A	70%	76%	82%	89%	92%	91%	90%	90%
≥300 A	70%	76%	82%	89%	92%	91%	90%	90%

Table 3. The results of parameter identification.

SOC	Battery			Ultracapacitor
SOC	$R_{i}$ (mΩ)	$R_{d}$ (mΩ)	$τ$	$R_{p}$ (mΩ)
0.1	43.3187	31.2894	43.2336	0.5705
0.2	43.9644	30.6417	55.2877	0.5747
0.3	43.5411	30.0794	32.8869	0.5615
0.4	43.6130	46.8999	60.4520	0.5426
0.5	43.4107	25.7370	50.1300	0.5437
0.6	42.6368	11.1927	23.6147	0.5217
0.7	43.8977	24.4476	42.1034	0.5326
0.8	44.1886	30.0639	54.5234	0.5246
0.9	46.6132	43.3526	67.6599	0.5347
1	51.2364	41.8538	51.6169	0.5460

Table 4. Parameter matching results of the battery pack and the ultracapacitor pack.

HESS Symbol	Parameters	Value
Battery pack	Type	NMC-2 Ah
	Series-Parallel connection number	100 S-25 P
	Nominal capacity	50 Ah
Ultracapacitor pack	Type	Maxwell-2.7 V-1500 F
	Series-Parallel connection number	135 S-3 P
	Nominal capacity	4500 F

Table 5. The range of state variables and action variable.

Parameters	Range	Unit
$S O C_{bat}$	[0.2, 1]	-
$S O C_{uc}$	[0.4, 1]	-
$P_{req}$	[−32, 56]	(kW)
$I_{bat}$	[−50, 100]	(A)

Note: SOC represents the ratio of the battery’s current stored energy to its maximum charge capacity. It is a dimensionless quantity, so no units are applicable.

Table 6. Key hyperparameters of EMS based on DRL.

Parameters	Settings and Values
Actor networks	32/32/16
Critic networks	32/32/16
Actor learning rates	0.0005
Critic learning rates	0.002
Discount factor	0.995
Experience buffer size	1,000,000
Minibatch size	256
Target smooth factor	0.001
Number of training episodes	200

Table 7. Characteristic parameters of different EMSs.

Strategy	Terminal SOC		Max Current (A)
Type	${SOC}_{bat}$	${SOC}_{uc}$	$I_{bat}$	$I_{uc}$
DP	0.3465	0.8583	85.0000	137.2945
DDPG	0.3452	0.8769	75.1178	101.9868
DQN	0.3124	0.8782	74.9343	168.4609
RL-based	0.3146	0.8854	175.6661	126.7752

Table 8. The value of energy loss.

Strategy	Battery Loss	Ultracapacitor Loss	DC/DC Loss	Total Loss
DP (kJ)	1315.2	136.1	2561.3	4012.6
DDPG (kJ)	1212.8	65.9	2760.0	4038.7
DQN (kJ)	1143.2	116.7	4374.3	5634.2
RL-based (kJ)	1914.4	340.9	3828.1	6083.4
DDPG-DP	7.8%	51.6%	−7.8%	−0.7%
DDPG-DQN	−6.1%	43.5%	36.9%	28.3%
DDPG-RL-based	36.6%	80.7%	27.9%	33.6%
DQN-DP	13.1%	14.3%	−70.8%	−40.4%
DQN-RL-based	40.3%	65.8%	−14.3%	7.4%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, W.; Huang, H.; Wang, C.; Xia, S.; Gao, X. A Comparative Study of Energy Management Strategies for Battery-Ultracapacitor Electric Vehicles Based on Different Deep Reinforcement Learning Methods. Energies 2025, 18, 1280. https://doi.org/10.3390/en18051280

AMA Style

Xu W, Huang H, Wang C, Xia S, Gao X. A Comparative Study of Energy Management Strategies for Battery-Ultracapacitor Electric Vehicles Based on Different Deep Reinforcement Learning Methods. Energies. 2025; 18(5):1280. https://doi.org/10.3390/en18051280

Chicago/Turabian Style

Xu, Wenna, Hao Huang, Chun Wang, Shuai Xia, and Xinmei Gao. 2025. "A Comparative Study of Energy Management Strategies for Battery-Ultracapacitor Electric Vehicles Based on Different Deep Reinforcement Learning Methods" Energies 18, no. 5: 1280. https://doi.org/10.3390/en18051280

APA Style

Xu, W., Huang, H., Wang, C., Xia, S., & Gao, X. (2025). A Comparative Study of Energy Management Strategies for Battery-Ultracapacitor Electric Vehicles Based on Different Deep Reinforcement Learning Methods. Energies, 18(5), 1280. https://doi.org/10.3390/en18051280

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparative Study of Energy Management Strategies for Battery-Ultracapacitor Electric Vehicles Based on Different Deep Reinforcement Learning Methods

Abstract

1. Introduction

2. The Modeling of the HESS and Component Selection

2.1. HESS Topology

2.2. Vehicle Dynamic Model

2.3. Battery and Ultracapacitor Model

2.4. DC/DC Converter Modeling

2.5. Battery and Ultracapacitor Experiments

2.6. Parameter Identification and Precision Validation

2.7. Parameter Matching of the Battery Pack and Ultracapacitor Pack

3. DRL-Based Energy Management Strategy

3.1. Reinforcement Learning

3.2. Deep Q Network

3.3. Deep Deterministic Policy Gradient

3.4. Confirm State Variables and Action Variable

3.5. Reward Function Settings

3.6. Training Setup

4. Simulation Results and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI