**3. The DDPG Algorithm**

The deep deterministic policy gradient (DDPG) [29,30] is an improved algorithm based on DQN algorithm that can solve the problem of multidimensional continuous action output. This optimization method can operate for continuous action space, and it ignores the specific optimization model, which can complete the black-box learning, focusing on only three concepts [20]: state, action, and reward, and the goal is to get the most cumulative reward.

The selection of DDPG algorithm mainly considers the following points.


In the real word, there is an interaction process between the Agent and its surrounding dynamic environment [32], which can be explained as follows: after the Agent generates an action under a certain state, the environment will give the Agent corresponding reward, and then the Agent enters the next state and will generate the next action. Reinforcement learning is a machine learning model whose modeling goal is to construct the Agent in the environment so that the Agent can always generate actions in the environment to maximize reward. Considering the definition in reinforcement learning, the state of the Agent at time *t* is *st*, the action under state *st* is *at*, the feedback from the environment is *rt*, and the next state the agent enters is *st*+1. Corresponding to the content of the current paper, at time *t*, the vector (*wt*, δ*t*) composed of the wheel speed (*wt*) and deflection angle (δ*t*) of each wheel is regarded as *st*. The drive torque distribution ratio of each wheel (*pt*) can be regarded as *at*, the vehicle *SOC* (*ut*) can be regarded as *rt*. The vector (*wt*+1, δ*t*+1) stands for *st*+1.

In reinforcement learning, the commonly used optimization objective (*Rt*) is the expectation of the total future reward at time *t*, which corresponds to the expectation of battery *SOC* in the future, as follows.

$$R\_t = r\_t + \gamma \cdot r\_{t+1} + \gamma^2 \cdot r\_{t+2} + \dots = \sum\_{i=0}^{+\infty} \gamma^i \cdot r\_{t+1} \tag{5}$$

where γ is a coefficient, 0 < γ < 1, which makes sure that *Rt* convergence. In order to be able to solve *Rt*, the above formula can be rewritten as an iterative formula.

$$R\_t = r\_t + \gamma \cdot R\_{t+1} \tag{6}$$

In the study of Q-learning, if we have the function <sup>Q</sup><sup>∗</sup> : *State* <sup>×</sup> *Action* <sup>→</sup> <sup>R</sup> to represent *Rt*, and then the optimal action strategy function A∗ can be obtained.

$$\mathcal{H}^\*(\mathbf{s}\_t) = \underset{a\_t}{\text{argmax}} \mathcal{Q}^\*(\mathbf{s}\_t, a\_t), \tag{7}$$

Usually as the environment is poorly understood, Q∗ cannot be directly accessed but Deep Neural Network has been proved to be universal function approximator, so it can be used to approximate Q∗ . In the current paper, Deep Neural Network is expressed as Q(*st*, *at*; Θ), where Θ represents the parameter to be solved. In fact, the deep fully-connected neural network is used. Therefore, when Q approaches Q<sup>∗</sup> , Θ is the optimal parameter Θ∗ , and the following equation can be obtained.

$$\mathcal{Q}(s\_{t\prime}a\_{t\prime}\ominus^\*) = r\_t + \gamma \mathcal{Q}(s\_{t+1\prime}a\_{t+1\prime}\ominus^\*),\tag{8}$$

Due to the optimal action strategy function A<sup>∗</sup> .

$$\mathcal{H}^\*(s\_t) = \underset{a\_t}{\text{argmax}} \mathcal{Q}(s\_t, a\_t; \Theta^\*), \tag{9}$$

so the Equation (8) can be expressed as follows.

$$\mathcal{Q}(\mathbf{s}\_{l}, a\_{l}; \Theta^{\*}) = r\_{l} + \gamma \mathcal{Q}(\mathbf{s}\_{l+1}, \mathcal{H}^{\*}(\mathbf{s}\_{l}); \Theta^{\*}), \tag{10}$$

Therefore, the optimization objective of Deep Neural Network can be defined as follows.

$$\mathcal{L}(\Theta) = \mathbb{E}\_{\substack{(s\_l, a\_l, r\_l s\_{l+1}) \sim \mathcal{P}\left[ \left( (r\_l + \chi \operatorname\*{argmax}\_{a\_l} \mathcal{Q}(s\_{l+1}, a\_l; \Theta) \right) - \mathcal{Q}(s\_l, a\_l; \Theta) \right)^2 \right]}}{\operatorname\*{argmin}\_{\Theta} \mathcal{L}(\Theta), \tag{11}$$

where <sup>L</sup>(Θ) denotes the optimization objective function with <sup>Θ</sup> as the independent variable. <sup>E</sup> is expectation, and P represents a probability distribution. The above equation is the optimization objective of DQN algorithm, but the optimization objective is only applicable when *at* is discrete. In the current paper, *at* is the multidimensional continuous space. So, considering an improved algorithm of DQN, DDPG uses Deep Neural Network A(*st*; Φ) to approximate the optimal action strategy function A∗ , so the optimization objective is as follows.

$$\mathcal{L}\_1(\Theta) = \mathbb{E}\_{\substack{(s\_l, a\_l, r\_l, s\_{l+1}) \sim \mathcal{P}\left[ \left( r\_l + \gamma \mathcal{Q}(s\_{l+1}, \mathcal{A}(s\_{l+1}; \Phi); \Theta) \right) - \mathcal{Q}(s\_l, a\_l; \Theta) \right)^2}}{\mathcal{L}\_2(\Phi) = \mathbb{E}\_{s\_l, \mathcal{P}\left[ \mathcal{Q}(s\_l, \mathcal{A}(s\_l; \Phi); \Theta) \right]}[\mathcal{Q}(s\_l, \Phi); \Theta)] \tag{12}$$
 
$$\operatorname\*{argmin}\_{\Theta} \mathcal{L}\_1(\Theta) \tag{12}$$
 
$$\operatorname\*{argmax}\_{\Phi} \mathcal{L}\_2(\Phi)$$

where L1(Θ) represents DQN algorithm optimization target, L2(Φ) denotes the optimization target of approximating the action strategy function A<sup>∗</sup> . In order to make the optimization process more stable, Φ and Θ in the Equation (12) are replaced with Φ*<sup>s</sup>* and Θ*<sup>s</sup>* corresponding to the soft update parameters.

$$\begin{aligned} \Phi\_{\mathfrak{s}} &= \tau \Phi + (1 - \tau) (\Phi\_{\mathfrak{s}}) \\ \Theta\_{\mathfrak{s}} &= \tau \Theta + (1 - \tau) (\Theta\_{\mathfrak{s}}) \end{aligned} \tag{13}$$

where τ is a coefficient, 0 < τ < 1. The expected calculation of L1(Θ) and L2(Φ) can be estimated approximately by Monte Carlo sampling, so the optimization objective is rewritten.

$$\mathcal{L}\_1(\Theta) = \frac{1}{N} \sum\_{i=1}^{N} \left( (r\_t^{(i)} + \gamma \mathcal{Q}(s\_{t+1}^{(i)}, \mathcal{A}(s\_{t+1}^{(i)}; \Phi\_s); \Theta\_s)) - \mathcal{Q}(s\_t^{(i)}, a\_t^{(i)}; \Theta) \right)^2$$

$$\mathcal{L}\_2(\Phi) = \frac{1}{N} \sum\_{i=1}^{N} \mathcal{Q}(s\_t^{(i)}, \mathcal{A}(s\_t^{(i)}; \Phi); \Theta) \tag{14}$$

$$\underset{\Theta}{\text{argmax}} \mathcal{L}\_1(\Theta)$$

$$\underset{\Phi}{\text{argmax}} \mathcal{L}\_2(\Phi)$$

where N is the number of the dimension, N = 8, (*i*) denotes the corresponding wheel number. In fact, stochastic gradient descent algorithm is used to optimize the two optimization targets alternately, and the parameter update method is as follows.

$$\begin{cases} \Theta^{(t)} = \Theta^{(t-1)} - a\_{\Theta} \nabla \mathcal{L}\_1 \big( \Theta^{(t-1)} \big) \\ \Phi^{(t)} = \Phi^{(t-1)} + a\_{\Phi} \nabla \mathcal{L}\_2 \big( \Phi^{(t-1)} \big) \end{cases} \tag{15}$$

When the optimal objective is reached, the parameters Θ∗ and Φ∗ are obtained, corresponding to Deep Neural Network Q(*st*, *at*; Θ∗) and A(*st*; Φ∗). The function A(*st*; Φ∗) can output a set of drive torque distribution ratio when the wheel speed and deflection angle are input in real time. The distribution ratio can make the expectation of *SOC* in the future maximum.

The network of *at* is called Actor network, then there are two networks in the algorithm, namely *Rt*-Q network and Actor network. Actor network is responsible for generating the action, which is the torque distribution ratio of each wheel. *Rt*-Q network is also commonly referred to as the Critic network, which is used to fit the sum of the system *SOC* for the next *n* steps, so that Actor network can have a clear optimization target. When the overall algorithm is executed, according to the training logic, Θ in the Q network is updated first, and then as a parameter is input to the Actor network to update Φ, with the aim of minimizing −Q. The actual training process is to train Θ and Φ in the two networks, and this process is called joint alternation training.

The overall implementation of architecture design is shown in Figure 5. The DDPG algorithm is directly embedded into the vehicle dynamics model by MATLAB Function to ensure real-time interaction. During the training process, the vehicle system outputs state and reward in real time. A total of 16-dimensional state signal is input to the Actor-network, including eight-dimensional wheel speed and eight-dimensional wheel deflection angle signals, and eight-dimensional wheel torque distribution ratio signal is output. For the Critic network, the same 16-dimensional state signal and eight-dimensional action signal output by the Actor-network are taken as the input to fit the sum of the energy consumption in the next *n* steps. In addition, the Train function is completed, which contains the logic of the algorithm training process, so that the Actor network and Critic network can update alternately according to the algorithm and complete the corresponding output.

In order to avoid the possible problems of data interaction between the two networks and Train function due to the synchronization of update in the model, all of them are written in a MATLAB Function module and directly called internally. At the same time, taking into account the actual passing ability of the vehicle, and preventing the long-term high torque output of individual motors to reduce the service life, the additional limitation is that the single-axle drive is not allowed in straight-line driving, with the 1st axle and 3rd axle as the main power distribution axle.

In addition, it needs to be clarified that the difference between the application scenario of the current paper and that of the traditional neural network algorithm is that the current action will directly affect the environment at the next moment. If the environment cannot be changed, actually only one step in the overall process is optimized.

**Figure 5.** Training process architecture of deep deterministic policy gradient (DDPG) algorithm.

## **4. O**ffl**ine Simulation Verification**

After the relevant algorithm code is completed and can interact with the vehicle model, the model needs to be trained for a certain number of times first. The purpose is to make the Actor and Critic network update their internal parameters according to the training logic of Train function to adapt to the whole system.

At present, there is no standard cycle condition for the evaluation of vehicle steering energy consumption, which results in the training condition of the model needs to be designed artificially. Different training conditions will affect the final optimization results of the model. The designed training condition should contain enough state samples of the optimized system. At the same time, it should be avoided that due to the influence of training environment, experience with certain type characteristics is particularly abundant, while experience with other type is scarce. At best, experience should have difference and similar experience should be minimized. During neural network training, some unexpected changes are not considered in the current paper, because they are difficult to be included completely. However, in order to avoid related problems, the average distribution as a conservative control scheme was combined with the neural network. By comparing the reward at any time, the control scheme with a higher reward is adopted, so as to ensure that the energy consumption of the vehicle was not lower than the conventional driving mode under any working condition, which is a supportability control strategy.

The state variables in the algorithm are the wheel speed and the wheel deflection angle. Therefore, based on the above principles, the model input of target vehicle speed and steering wheel angle are shown in Figure 6. During training, only the first and second axles were steering axles. Meanwhile, considering the stability problem of the vehicle in high speed, the amplitude of the steering angle decreased after 40 seconds.

According to the training conditions, after completing about 100, 200, ..., 500 times training, data and driving state curves were recorded. Figure 7a shows the change process of vehicle speed after different training times. The change of vehicle speed was little affected by the drive torque distribution and the target vehicle speed could be well followed. Since the optimal torque distribution is equivalent to applying an additional yaw moment for the vehicle, so the yaw rate of the vehicle was increased in each period after distributing, which can be seen in Figure 7b, and it is in line with the actual situation. Figure 7c is a comparison of the *SOC* change after the corresponding training number. It can be seen that the *SOC* decline decreased with the increase of training times. After 500 times of training, the *SOC* decline of this training condition was reduced by about 4.5320%.

**Figure 6.** Model input of the training condition: (**a**) target vehicle speed; (**b**) steering wheel angle.

**Figure 7.** Changes of vehicle driving parameters and battery state of charge (*SOC*) after training: (**a**). Changes of vehicle speed with training times; (**b**). Changes of yaw rate with training times; (**c**). Changes of *SOC* with training times.

After completing the training, only the parameter matrix in the Actor network is retained and stored into the MATLAB Function, which receive the driving state of the vehicle in real-time and generate the optimal distributing action. In theory, the more training times, the more stable and optimal parameters in the Actor network tend to be, and the better the optimization effect will be. However, with the increase of training times, the rate of optimization return is decreased. Meanwhile, in order to ensure the optimal effect, a fixed simulation step size of 1 millisecond was adopted in the Simulink, while the action was updated every 10 steps by the control algorithm, which led to a significant increase in the computational burden of the model. After completing 400 and 500 times training, and comparing the simulation results, it can be found that the optimization effect was almost the same. Therefore, considering the optimization efficiency, finally the model training was completed for 500 times.

### *4.1. Conventional Low-Speed Step Steering Condition*

The low-speed simulation condition was designed to accelerate the vehicle from the stationary state with a target speed of 30 km/h. At the 20 s, the steering wheel turned about 230◦ within 1 s, and only the first and second axles were steering axles. Figure 8a shows the actual change in speed of vehicle. It can be seen that after the steering angle change, the vehicle speed decreased slightly, which was caused by the increase of driving resistance. It is consistent with the actual situation. Figure 8b is a detail view of vehicle speed. Compared with the average distribution, the steady-state vehicle speed increased slightly after the optimal distribution of drive torque, but the difference was not significant. Because the redistribution of drive torque led to the reduction of additional steering resistance, the drive torque required to maintain steady state was reduced. It can be seen from Figure 2 that under the condition that the target vehicle speed remained unchanged, the actual vehicle speed increased.

**Figure 8.** Changes of vehicle speed during low-speed simulation: (**a**). Changes of vehicle speed; (**b**). Partial enlarged drawing.

Figure 9 shows vehicle yaw rate change and the vehicle track comparison respectively. After optimization control, the yaw rate of the vehicle increased by around 1.02%, and the radius of the track was also slightly reduced. From Figures 8 and 9, it can be seen that optimal torque distribution promoted the steering trend, but the influence on the various driving state parameters of the vehicle was not significant, and did not cause the stability problem.

**Figure 9.** Vehicle yaw rate change and the vehicle track comparison: (**a**). Changes of yaw rate; (**b**). Comparison of driving trajectory.

It can be seen from Figure 10a that after adopting torque optimization control, *SOC* decline was significantly reduced and the energy consumption was reduced by about 3.7856% between 0 s and 40 s. However, it included the linear acceleration phase, although the torque was also optimally distributed during straight-line driving, the motor basically worked on the external characteristic curve during acceleration. At the same time, there was no training for the straight-line driving condition, so the optimization effect was not obvious. Then only for the steering phase between 20 s and 40 s, the vehicle energy consumption can be reduced by about 5.112% after optimization.

**Figure 10.** Changes of vehicle *SOC* and wheel drive torque after optimization: (**a**). Changes of vehicle *SOC*; (**b**). Changes of wheel drive torque.

Figure 10b shows the change of the drive torque of each wheel. In the linear acceleration phase, the drive torque of the whole vehicle was mainly distributed to 1st axle and 3rd axle, similar to the two-axle drive, which increased the working load of some drive motors and improved overall work efficiency. When steering, the drive torque of the outboard wheel increased, and the drive torque of the inboard wheel decreased. Besides, the drive torque of rear axle of the outboard wheel was relatively larger, because in the same cases, the change of the drive torque of the rear axle had a greater influence on the additional yaw of the whole vehicle, which is more conducive to the reduction of the energy consumption. In addition, the multi-axle vehicle body is longer, resulting in the effect is relatively more obvious. When the vehicle was in steady-state steering, the driving torque of the whole vehicle is

about 3107 Nm by average distribution, while the total driving torque is about 2975.4 Nm by optimized distribution, which is relatively reduced by about 4.2356%. Another part of the reduction in energy consumption comes from the improvement of motor working efficiency.

Figure 11 shows the comparison of working point change in the motor efficiency map. The wheel speed and output torque during steady-state steering are respectively derived. Based on the deceleration ratio, the actual working points of each in-wheel motor were calculated. As the relative speed difference between the left and the right wheel was very small, which can be approximately ignored, a point was used to represent the actual working point of each motor when the drive torque was evenly distributed. After the optimal torque distribution control was adopted, the actual working point of each motor was changed. The drive torque of the outboard wheel was increased, and the working efficiency was improved. Though the working efficiency of inboard wheel reduced, its drive torque was small, which led to the overall working efficiency being improved.

**Figure 11.** Comparison of motor working points.

## *4.2. Conventional High-speed Sinusoidal Steering Condition*

The high-speed simulation condition was designed to accelerate the vehicle from the stationary state with a target speed of 70 km/h. At 20 s, the steering wheel input a sine wave with an amplitude of 110◦ as shown in Figure 12a. Similarly, 1st axle and 2nd axle were steering axles. Figure 12b,c show changes of the vehicle speed and the yaw rate. Similar to the step steering condition, the change of driving state was not obvious and the peak of yaw rate increased slightly. Figure 12d shows the change of drive torque. Due to the input of the steering wheel constantly changing, the curvature radius of the vehicle driving was also changing. It can be seen from Equation (3) that the additional steering resistance fluctuated accordingly. Therefore, when the driving torque was evenly distributed, the driving torque of each wheel also changed correspondingly. After optimized distribution, the more drive torque was distributed to the wheel of the outboard and rear axles, which promoted the steering of the vehicle. Under the dynamic steering condition, the driving torque of each wheel could follow the changes of system input, which indicates that the optimal control algorithm could adapt to the dynamic environment.

The changes of *SOC* can be seen from Figure 13a. After the optimization control, the *SOC* decline reduced by 2.6213% between 0 s and 40 s. If only comparing the *SOC* change during steering phase, the energy consumption of the vehicle decreased by 4.0482% after optimization as shown in Figure 13b. It was proved that the optimal torque distribution control based on energy consumption could reasonably distribute the drive torque of each wheel and reduce the energy consumption under the dynamic condition. That means the optimization algorithm adopted was not limited to specific working conditions, which can be for any steering conditions, whether static or dynamic. The optimization algorithm could optimize the distribution of driving torque in real time and reduce the vehicle energy consumption. However, the optimization effect was slightly worse than that of low

speed test, which was mainly for two reasons. On the one hand, the sine wave input was a dynamic process all the time, but there had to be system inertia in the mechanical system, which may have led to the actual action and control signals not being completely synchronized. Although the effect was relatively small for the electric vehicle with in-wheel motor, it could not guarantee that the drive torque of each wheel was optimal at any time; on the other hand, when the motor worked at a high speed, the high efficiency area on the efficiency map was relatively large, so the optimization effect after the control was slightly lower.

**Figure 12.** Changes in vehicle driving parameters during high-speed simulation condition: (**a**). Steering wheel angle; (**b**). Vehicle speed; (**c**). Yaw rate; (**d**). Drive torque of each wheel.

**Figure 13.** Changes of *SOC* after the optimization control: (**a**). Changes of *SOC* in the whole process; (**b**). Changes of *SOC* in the steering phase.
