Research on the Deep Deterministic Policy Algorithm Based on the First-Order Inverted Pendulum

Hu, Hailin; Chen, Yuhui; Wang, Tao; Feng, Fu; Chen, Weijin

doi:10.3390/app13137594

Open AccessArticle

Research on the Deep Deterministic Policy Algorithm Based on the First-Order Inverted Pendulum

by

Hailin Hu

^1,2

,

Yuhui Chen

^1,2,

Tao Wang

^1,2,*,

Fu Feng

^1,2 and

Weijin Chen

^1,2

¹

School of Electrical Engineering and Automation, Jiangxi University of Science and Technology, Ganzhou 341000, China

²

Jiangxi Provincial Key Laboratory of Maglev Technology, Ganzhou 341000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(13), 7594; https://doi.org/10.3390/app13137594

Submission received: 24 May 2023 / Revised: 16 June 2023 / Accepted: 26 June 2023 / Published: 27 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

With the mature development of artificial intelligence technology, the application of intelligent control algorithms in control systems has become a trend to meet the high-performance requirements of modern society. This paper proposes a deep deterministic policy gradient (DDPG) controller design method based on deep reinforcement learning to improve system control performance. Firstly, the optimal control policy of the DDPG algorithm is derived from the Markov decision process and the Actor–Critic algorithm. Secondly, in order to avoid local optima in traditional control systems, the capacity and the settlement method of the DDPG experience pool are adjusted to absorb positive experience to accelerate convergence and to complete efficient training. In response, and to solve the overestimation of the Q value in DDPG, the overall structure of the Critic network is changed to shorten the convergence period of DDPG at low learning rates. Finally, a first-order inverted pendulum control system was constructed in a simulation environment to verify the control effectiveness of PID, DDPG, and improved DDPG. The simulation results reveal that the improved DDPG controller has a faster response to disturbances, smaller displacement, and angular displacement of the first-order inverted pendulum. The simulation further proves that the improved DDPG algorithm has better stability and convergence and stronger anti-interference ability and stability recovery. This control method provides a certain reference for the application of reinforcement learning in traditional control systems.

Keywords:

deep deterministic policy algorithm; optimal control policy; local optimum; overestimation of Q value; traditional control systems

1. Introduction

Modern society has put forward more and more requirements for the dynamic performance of control systems. Traditional control methods like PID [1] and fuzzy control [2] cannot achieve satisfactory results in terms of anti-interference ability, dynamic performance, and nonlinear adjustment accuracy, especially in continuously moving space where fast convergence is not possible [3]. Furthermore, traditional control methods require precise mathematical modeling as a prerequisite before formulating control strategies. In practical control systems, especially in first-order inverted pendulums, due to the numerous influencing factors, the actual system is mostly nonlinear and has strong coupling, making it very difficult to accurately establish a model that can describe the process of the system’s motion. The modeling in traditional control methods is a linearized mathematical model near the system’s equilibrium point, which leads to poor robustness of the control algorithm. When the system state is far from the equilibrium point, the effect of the control algorithm will be greatly decreased or even ineffective.

Based on the above, deep reinforcement learning is introduced into nonlinear system control. Deep reinforcement learning is one of the major branches of artificial intelligence, and it has made significant progress in recent years. DDPG was proposed by Lillicrap T P et al. in 2015, which applies the deep deterministic policy gradient as a replay buffer to formulate the target policy. It combines the deep Q-network (DQN) [4] and deterministic policy gradient (DPG) [5] to solve the convergence problems of neural networks and slow algorithm updates in a continuously moving space.

At the algorithm level, to improve the training efficiency of DDPG, avoid local optima, and accelerate convergence, Li et al. proposed that in the off-policy reinforcement learning algorithm, an experience pool is used to store the experience sample data. Suitable sampling strategies are combined to sample the dataset in the buffer pool, which can accelerate training [6]. Schaul et al. proposed a mechanism of a buffer pool priority experience replay and applied it in DQN. This method can increase the probability of important experiences being replayed, thereby making learning more efficient and achieving good results in some experiments [7]. Zhang Jianxing et al. proposed a deep deterministic policy gradient based on plot experience replay (EEP-DDPG), which saves sample data as plot units and classifies them through two buffer pools. During the training phase, it focuses on selecting data with higher cumulative returns for sampling to improve training efficiency [8]. The main ideas for improving the buffer pool above are focused on using more experience data with higher return values to improve the efficiency of the experience playback. This paper starts from the environmental status of each experience data, designs a recursive small experience pool structure, and optimizes the settlement method.

In order to find the optimal Q value, Doltsinis et al. and Wang et al. proposed a Q-learning algorithm and a SARSA algorithm, respectively [9]. However, when the state and action dimensions increase, the size of the Q table will increase in geometric multiples, resulting in the ‘curse of dimensionality’ [10]. In 2015, Mnih et al. used deep neural networks to fit the Q table and then proposed the DQN algorithm. However, using a greedy policy to select the Q value will lead to an overestimation of the Q value, resulting in a significant deviation [11]. Therefore, Hasselt et al. proposed the DDQN (Double Deep Q-Network) to decouple action selection and Q-value calculation. The main network selects the action with the highest corresponding Q-value and uses this action to calculate the Q-value of the target network to solve the overestimation of the Q-value [12]. To improve the performance of the DQN algorithm, Wang et al. proposed the Dueling DQN algorithm. Unlike the DQN algorithm, this algorithm divides the Q network into a value function and an advantage function. The final output is a linear combination of the reward function and the punishment function. This prevents excessive Q values and improves the expression ability of the state action reward function [13].

In the field of motion control, Zhang et al. used the DQN in 2015 to directly train the robot controller’s ability from raw pixel images without prior knowledge [14]. In 2017, Gu et al. used the DQN algorithm without prior knowledge to train robots’ spatial operation ability and complete complex limb movements represented by opening doors [15]. In 2017, Sallab et al. used the DDPG algorithm to test the automated driving of 3D cars based on TORCS in a simulator under a simulation environment. The results showed that autonomous vehicles can learn complex road conditions and interact with other vehicles better than non-autonomous vehicles [16]. In 2020, Liu et al. conducted balance control experiments on a bipedal robot using a DDPG algorithm. After multiple rounds of training, the posture angle of the bipedal robot was controlled at [−4°, 4°], and its stability performance was significantly improved [17]. Liu et al. applied DDPG in robots to improve the target tracking and obstacle-avoidance control in 2022. The results show that DDPG has better performance than PID [18]. In 2021, Xue et al. proposed a fractional gradient-descent RBF neural network applied in DDPG to control the inverted pendulum. According to the fractional order, RBF adopted differentiation or integration to accelerate the convergence speed of the gradient descent method and improve its control performance [19]. In 2022, Wang et al. adopted the Swish function and baseline function as an activation function to improve the training efficiency of DDPG, which was verified in an inverted pendulum system [20].

The DDPG algorithm has good knowledge transfer ability and can learn the mathematical model of a system in a black box or gray box model based on its input and output. Based on the reward function, observation function, and constraint conditions, it can achieve optimal control of the control system [21]. However, DDPG follows the same method as DQN to update the Q value, which can easily lead to the overestimation of the Q value for certain control actions. This will lead to the increased bias and suboptimal strategies. In addition, DDPG is very sensitive to the setting of hyperparameters, which will directly affect its convergence stability.

With the development and maturity of artificial intelligence technology, the application of intelligent control algorithms in control systems has become a trend. This paper proposes a deep deterministic policy gradient algorithm for control and improvement. Firstly, based on the principle of reinforcement learning, the DDPG algorithm is developed to give the system strong self-learning and self-tuning capabilities. Then, by optimizing the experience pool and the critic network structure, the problem of local optimization, long convergence period, and overestimation of the Q value in DDPG under low learning rates is solved. By learning to update the control policy, the dynamic performance, anti-interference ability, and response speed under load disturbances are improved. Finally, relying on the verification and comparison of the classic first-order inverted pendulum with the classic PID, DDPG, and improved DDPG algorithms, the better dynamic performance of the improved DDPG controller is verified.

2. Construction of a Control System Based on DDPG

Reinforcement learning enables interactive learning between agents and the environment to complete specific tasks and maximize reward values [22]. The essence of reinforcement learning can be concretized as a Markov decision process, where the value-and -policy-based optimization strategies derived from it are combined and complementary, forming the Actor–Critic algorithm. Combined with the DQN algorithm for experience playback and target network construction, the basic framework of the DDPG algorithm is formed. For control systems, the key to introducing the DDPG algorithm is to construct a suitable neural network structure based on the feedback provided by the system from the environment and determine the learning rate of the Critic and Actor networks based on the output dimension of the neural network. At the same time, according to the order of magnitude and the number of feedback quantities, the corresponding reward function form is adopted. Finally, the hyperparameter of training is set to ensure that the training process is within a reasonably convergent process and avoid local optimization and sub optimization. The overall control flowchart of DDPG is shown in Figure 1.

2.1. Markov Decision Process

The decision-making process of the neural network constructed by DDPG can be represented as a Markov decision process (MDP), which is a theoretical framework for achieving the ultimate goal through interactive learning. It can be described by quads (S, A, R, P), as shown in Figure 2, which shows the interaction between the agent and the environment during the decision-making process.

During the interaction process, the intelligent agent generates an action A_t based on the current state S_t, then obtains the next state S_t+1 from the environment, and then obtains the reward sR_t+1 obtained by outputting the current action. For the benefits obtained during the interaction process, the ultimate goal of the intelligent agent is to maximize the cumulative reward during long-term operation.

\begin{array}{l} G_{t} & = R_{t + 1} + γ (R_{t + 2} + γ R_{t + 3} + γ^{2} R_{t + 4} + \dots) \\ = R_{t + 1} + γ G_{t + 1} \end{array}

(1)

The state value s when taking action a under the policy

π

is called the action value function

q π (s, a)

of the policy

π

. The equation of the state values at two time points is derived in Equation (2). The equation for the action value function can also be obtained as shown in Equation (3).

ν_{π} (s) = \sum_{a \in A} π (a | s) q_{π} (s, a)

(2)

q_{π} (s, a) = R_{s}^{a} + γ \sum_{s^{'} \in S} p (s^{'}, r | s, a) ν_{π} (s^{'})

(3)

When solving problems through reinforcement learning, finding an optimal policy

π

to maximize long-term benefits is the key. Equation (4) shows that searching for the optimal policy

π

can be based on the state value function or the action policy function. Therefore, due to the different criteria, reinforcement learning algorithms based on value and policy have also emerged.

\begin{array}{l} ν_{*} (s) = \max_{π} ν_{π} (s) \\ q_{*} (s, a) = \max_{π} q_{π} (s, a) \end{array}

(4)

2.2. Actor–Critic Algorithm

In DDPG, the Critic network mainly evaluates the quality of actions based on state value, while the Actor network determines the next action based on action policy. Firstly, an optimization policy based on state value is introduced. Its core is to form a Q table of states and actions in the environment. The data in the table include the value Q(s, a) of the action value function when taking an action from the current state. An action is selected based on the table, and it is executed. When selecting an action, the greedy algorithm is followed, as shown in Equation (5), where m is the total number of selectable actions. After operation, a reward value will be obtained, updating Q(s, a) as shown in Equation (6), where a is the learning rate, and then the values in the Q-table are updated.

π (a | s) = {\begin{matrix} ε / m + 1 - ε i f a^{*} = \arg \max_{a \in A} Q (s, a) \\ ε / m e l s e \end{matrix}

(5)

Q (S, A) \leftarrow Q (S, A) + α [R + γ \max_{a} Q (S^{'}, A^{'}) - Q (S, A)]

(6)

In practical terms, there are often many state vectors and actions that can make tables very large and make them fall into dimensional disasters. So on the basis of Q-learning, a neural network was added to approximate the value function by neural network to form a deep Q-network. At the same time, the experience playback mechanism [23] and target network were added to improve the convergence of the network. It contains the current network and the target network, which have the same structure. The current network selects actions and updates the network parameter

θ

, as shown in Equation (7). The target network calculates the Q target value as

Q (s^{'}, a^{'}; {θ^{'}}_{i})

, and updates the network by minimizing the loss function shown in Equation (8), where

ρ (.)

is the probability distribution of sequence s and action a.

y_{i} = E_{s^{'} - E} [r + γ \max_{a^{'}} Q (s^{'}, a^{'}; θ_{i}^{'})]

(7)

L_{i} (θ_{i}) = E_{s, a ~ ρ (.)} [{(y_{i} - Q (s, a; θ_{i}))}^{2}]

(8)

Figure 3 shows the training process of DQN algorithm:

From the above analysis, it can be seen that the DQN series of algorithms are suitable for dealing with discrete action problems, and the output policy is a deterministic policy. However, many problems in practice are continuous actions; they are not deterministic strategies but random strategies. Therefore, scholars have proposed reinforcement learning algorithms based on action strategies, which are algorithms used in Actor networks, i.e., policy gradient algorithms. By outputting the probability of action occurring, one or more optimal actions are selected. Assuming that there is an intelligent agent’s exploration trajectory

τ

,

π (a | s, θ)

is a random policy. The policy gradient can be expressed as Equation (9). After this equation is solved, the parameters are updated in a gradient ascending manner, as shown in Equation (10).

\nabla_{θ} J (θ) = \frac{1}{N} \sum_{i = 1}^{N} [\sum_{t = 0}^{T} \nabla_{θ} \log π_{θ} (a_{i, j} | s_{i, j}) R (τ^{(i)})]

(9)

θ \leftarrow θ + α \nabla_{θ} J (θ)

(10)

By inputting the state into the policy network and outputting the probability of each action, the action can be selected within a continuous interval. However, its updates are based on round updates, which will decrease the efficiency of learning. Value-based algorithms and policy-based algorithms each have their own advantages and disadvantages. So researchers combined the two to form the Actor–Critic algorithm. An Actor is a policy network responsible for interacting with the environment, inputting states and then outputting actions executed by the agent. Critic is a value network, inputting states and actions, outputting evaluations of the actions generated by the policy network. It will evaluate the current quality of the policy network’s actions, then optimize the policy network to produce better actions. The flowchart of Actor–Critic algorithm is shown in Figure 4.

The Actor–Critic network is updated based on the time difference method [24]. According to the Bellman equation, it is known that the state values at two times follow a recursive relationship. The TD error can be expressed as Equation (11).

δ = R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) - Q (S_{t}, A_{t})

(11)

The critic network is updated by minimizing the TD error shown in Equation (11). The update method for the Actor network is derived from Equations (10) and (11) as Equation (12).

θ = θ + α \nabla_{θ} \log π_{θ} (S_{t}, A) δ

(12)

3. Improved DDPG Based on Local Optimization and Q-Value Overestimation

3.1. Recursive Small Experience Pool DDPG Algorithm

In the Actor–Critic algorithm, the output actions of the Actor network depend on the guidance of the Critic network. However, due to the inherent problems of difficult convergence and easily falling into local optima, the combination of the two networks will further increase the difficulty of network convergence [25]. Therefore, in order to solve the problems in the Actor–Critic algorithm, some scholars have integrated the solutions in the DQN algorithm to form a deep deterministic policy gradient algorithm. The DDPG algorithm includes a Critic network based on value updates, an Actor network based on policy updates, and an experience pool for storing samples. The Critic network is divided into the current network C_e and the target network C_t, while the Actor network is divided into the current network A_e and the target network A_t. The samples in the experience pool are generated by A_e, which generates an action from the current state S, as shown in Equation (13):

a_{t} = μ (s_{t} | θ^{μ})

(13)

Among them,

μ

represents the current strategy.

θ^{μ}

represents the parameters of the current network A_e. After executing this action, the reward value r and the next state S_t+₁ are generated, and this set of values is stored in the experience pool. The current network C_e calculates the actual Q value

Q (S_{i}, a_{i} | θ^{Q})

and the target network C_t calculate the target Q value y_i as shown in Equation (14), where

θ^{Q^{'}}

is the network parameter of the C_t network and

θ^{μ^{'}}

is the parameter of the A_t network. Then, the C_e network and the A_e network are updated by minimizing the value of loss function. The definition of the C_e network loss function is shown in Equation (15) and the loss function of A_e network is shown in Equation (16):

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

(14)

L o s s = \frac{1}{N} \sum_{i} {(y_{i} - Q (s_{i}, a_{i} | θ^{Q}))}^{2}

(15)

\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum \nabla_{a} Q (s, a | θ^{Q}) |_{s = s, a = μ (s_{i})} \nabla_{θ^{μ}} μ (s | θ^{μ}) |_{s_{i}}

(16)

After

θ^{Q}

and

θ^{μ}

are updated,

θ^{Q^{'}}

and

θ^{μ^{'}}

are updated through soft updates. Unlike the DQN algorithm, which directly copies the parameters of the current network to the target network, this paper adds an updated coefficient

τ

in the DDPG algorithm, which is determined by the size of the experience pool and the effective experience. In short, the experience pool is divided into n small experience pools, an evaluation function is introduced in advance to evaluate the small experience pool, the effective experience is stored in a new small experience pool, and all the effective experience is finally extracted for network updates. Although

τ

is small, it can control each parameter to be updated relative to the previous one and effectively improve local optima and suboptimal, reduce training time, and improve control effectiveness. The updating method is shown in Equation (17):

θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}

(17)

3.2. Optimization Design of Critic Network Structure

In the value-based reinforcement learning method, the Approximation error of the function will lead to an overestimation of the value and a suboptimal policy. The problem of overestimating the Q value of the Q function still exists in the DDPG algorithm. Moreover, the critic network uses the time-series difference algorithm to estimate the value function, which itself will have unavoidable bias, so the overestimation cannot be avoided in the algorithm. In the process of reinforcement learning training, the goal of maximizing the value of a certain process and the optimal Q value Q(s, a) is key to solving the cumulative reward maximization and optimal policy. Therefore, the accuracy of the Q value estimation is related to the quality of the entire learning process. Moreover, the overestimation problem will cause a cumulative error as the training continues. This error will result in some poor samples having higher Q values and being selected during training, which may form local optima and lead to suboptimal updates and the divergent behavior of the strategy. The training process will not converge.

Simply changing the experience pool size of the Critic network cannot fundamentally solve the overestimation problem of Q values caused by greedy characteristics. Therefore, in such cases, it is possible to consider setting one or more reference networks to evaluate the Q value of the Critic network. One or more sets of networks have been added to the Critic network, including the real network and the target network. When the Actor network generates an action, these two or more networks are used for estimation. The minimum Q target value generated by the two target networks is taken, and relatively fewer overestimated Q values are selected. Then, these Critic networks are updated, as shown in Equation (18):

y_{i} = r + γ \min_{i = 1, 2} Q_{θ_{i}^{'}} (s^{'}, a^{'})

(18)

After that, we can update the loss function of the value network. For the update of the policy network, we can use any value network. Because the two networks will become more and more similar in the end, it will not affect the update of the policy network. At the same time, the improved algorithm adds noise obeying the normal distribution when the target network of the strategic network generates actions to avoid the overfitting of the value function. The updating method is shown in Equation (19):

\begin{array}{l} y_{i} = r + γ \min_{i = 1, 2} Q_{θ_{i}^{'}} (s^{'}, a^{'} + ε) \\ ε ~ c l i p (N (0, σ), - c, c) \end{array}

(19)

Based on the above improvements, the flowchart of the improved DDPG algorithm is shown in Figure 5.

4. Experimental Results and Analysis

In this section, we use a first-order inverted pendulum as the controlled object to validate the DDPG algorithm, the improved DDPG algorithm, and traditional PID methods. Next, we will analyze and compare their control effects. An inverted pendulum is a universal and classic physical model in traditional control systems. As a controlled object, it is also a relatively complex system with high order, nonlinearity, instability, multivariable, strong coupling, and other characteristics, and it is widely used in the verification of controlled stability problems.

4.1. Simulation Environment

The simulation environment was mainly conducted in the Matlab/Simulink environment, using the Simscape Multibody Toolbox and Reinforcement Learning Toolbox provided in the development environment for experimental simulation. The Simscape Multibody Toolbox provides a multi-body simulation environment for mechanical systems, which can use modules such as solids, joints, constraints, and force elements to model multi-body systems and solve the motion equations of complete mechanical systems. At the same time, 3D animations are generated to visualize the motion effects. In this paper, the slider cart of the inverted pendulum is connected to the pendulum rod through joint structure and coordinate transformation to form a complete first-order inverted pendulum model. The key parameters are shown in Table 1.

The Reinforcement Learning Toolbox is considered as a proxy agent for the DDPG algorithm, connecting the entire network. This module is connected to the corresponding environment and set reward functions. The reward

r_{t}

is defined by Equation (20):

r_{t} = - 0.1 (5 θ_{t}^{2} + x_{t}^{2} + 0.05 u_{t - 1}^{2}) - 100 B

(20)

where

θ_{t}

is the angle displacement from the upright position of the pole.

x_{t}

is the displacement from the center position of the cart.

u_{t - 1}

is the control effort from the previous time step. B is a flag (1 or 0) that indicates whether the cart is out of bounds. The coefficient parameter is applied for standardization, and the stop value is −400. After training, the Agent module can act as a controller to input the system and receive feedback adjustment inputs. The experimental environment is shown in Figure 6.

Combining mature engineering test routines, the PID controller selects a displacement controller with PID parameters of K_p = 16, K_i = 5, and K_d = 3 and angular displacement controller with PID parameters of K_p = 125, K_i = 20, and K_d = 6. The hyperparameter settings selected in DDPG training are shown in Table 2.

Before the simulation, the initial state of the entire inverted pendulum has a greater impact on the control results of the nonlinear system. Considering the actual physical significance of an inverted pendulum, the input quantity needs to be limited. The range of motion is set to [−3.5,3.5] m. The starting position of the DDPG controller is set at the center of the guide rail, and the swing rod is in a natural drooping state. The entire simulation process achieves the transformation of the swing rod from a natural drooping state to a vertical upward state and maintains balance within the limited range of motion. The PID controller is initially in a vertical upward state and maintains balance during the simulation process.

4.2. Results and Analysis

After the above hyperparameters and the corresponding neural network are constructed, the reinforcement learning training parameters were set, as shown in the above table, and the training was started. The training results are shown in Figure 7 and Figure 8.

In the training figures, the training reward is shown in each training episode. Episode Q0 is the convergence index maximized by the reward function, which decreases by the discount factor in each episode during the training process. The episode reward was calculated using the reward function due to the performance of each training episode. The average reward is set to ensure that the agent will not converge prematurely due to several outstanding rewards. It is calculated as the current total rewards divided by training episodes.

From the training figures, it can be seen that DDPG requires 235 training cycles to complete the training, while improved DDPG only requires 145 training cycles to complete the training. This is because, after the experience pool size, settlement method, and the Critic network structure are improved, the Q-value of each step is kept at a reasonable level, reducing overestimation and suppressing local optima and suboptimal phenomena during the training process. This allows the Actor network to learn with fewer steps during experience accumulation and climbing, quickly absorbing positive gain experience, avoiding local optima, completing experience accumulation in fewer steps, and improving overall training efficiency.

The trained DDPG network, the improved DDPG network, and the traditional PID control are tested on the same controlled target. At the same time, white noise with a power of 0.1 w and a seed of 23,341 are added to the simulation. Blue, red, yellow and purple curves represent displacement, angular displacement, velocity and angular velocity respectively. The white noise experimental results are shown in Figure 9, Figure 10 and Figure 11.

Subsequently, a square wave signal with a period of 2 s, a duty cycle of 20%, an amplitude of 1, and a delay of 5 s was added as the interference force in the simulation. The interference experimental results are shown in Figure 12, Figure 13 and Figure 14.

From the figures showing the experimental results, it can be seen that the first-order inverted pendulum system controlled by the DDPG algorithm successfully started oscillation around 3 s. After 2 s, it withstood interference to maintain balance until the end of the experiment in 10 s. Therefore, the first 3 s in the graph of simulation results reflects the large-range fluctuations of various parameters of the entire system during oscillation. The initial state of the first-order inverted pendulum under PID control is vertical and upward, with significant fluctuations at the beginning of the simulation and gradually stabilizing around 3 s. Similarly, after 5 s, it undergoes interference to maintain balance until the end of the 10 s experiment.

The comprehensive simulation results show that the DDPG controller and the improved DDPG controller have slightly inferior control performance and retention in displacement and speed compared with the PID controller, but both are within a reasonable control range. At the same time, the fluctuation range of the DDPG controller is more excellent in angular displacement and angular velocity control. Compared with the PID controller, the DDPG controller shows an order-of-magnitude advantage. It has stronger anti-interference and recovery stability in the case of white noise and sudden interference force. From a practical perspective, slight fluctuations in the angular displacement of an inverted pendulum can lead to significant fluctuations in displacement, and poor control can lead to the collapse of the overall system balance. The reward function used in the DDPG controller in this paper has a weight of five times the displacement for angular displacement. Therefore, the agent increases the priority of angular displacement in training, which in turn sacrifices the anti-interference ability of some displacements to improve the anti-interference ability of angular displacement. Therefore, compared to more balanced and versatile PID controllers, DDPG controllers may experience this situation.

Comparing the DDPG controller with the improved DDPG controller, the balanced strategy of the DDPG controller is too singular, and it only completes the balance task within the corresponding time and neglects to offset the lateral velocity. This will result in the inverted pendulum continuously shifting laterally on the slide rail, which will exceed the given range of motion during long-term motion. The improved DDPG controller suppresses overestimated Q values, allowing it to explore more effectively in the action space. Therefore, after each fluctuation, it will be appropriately shifted back, and compared to the DDPG controller, it effectively reduces the range of displacement fluctuations. In the phase of relative balance after the starting vibration, under the action of white noise, the maximum speed, angular displacement, and angular velocity fluctuation range of the DDPG controller are 0.9027 m/s, 0.0892 rad, and 0.8386 rad/s, respectively, and the maximum speed, angular displacement, and angular velocity fluctuation range of the improved DDPG controller are 0.7092 m/s, 0.0759 rad, and 0.7527 rad/s, respectively. Under the influence of interference force, the maximum speed, angular displacement, and angular velocity fluctuation range of the DDPG controller are 0.3821 m/s, 0.04243 rad, and 0.2410 rad/s, respectively, and the maximum speed, angular displacement, and angular velocity fluctuation range of the DDPG controller are 0.3595 m/s, 0.02551 rad, and 0.2119 rad/s, respectively. In the longitudinal comparison, due to the high frequency and large amplitude of white noise, compared with the interference force with low frequency mechanical characteristics, it will cause greater impact on system stability, so the fluctuation of the parameter results of white noise is greater than that of the interference force. The improved DDPG controller has a small range of fluctuation in speed, especially angular displacement and angular velocity, and it has strong anti-interference ability. It is within the motion control range throughout the entire waveform, except for overshoot, which must be generated during the initiation of vibration. Under the influence of interference force, the DDPG controller regains stability after being disturbed for 1.725 s, while the improved DDPG controller regains stability after being disturbed for 1.388 s. Compared to the DDPG controller, the improved DDPG controller has a faster response speed in horizontal comparison, and in the case of sudden disturbance to the controlled system, the interference received is small, resulting in faster restoration of stability. The four evaluation indicators of displacement, velocity, angular displacement, and angular velocity are all within a reasonable and controllable range.

5. Conclusions

This paper analyzes the control principle of the DDPG algorithm, starting from the Markov decision process, introduces the Actor–Critic algorithm, and combines it with the DQN algorithm to derive the gradient-updating process of the DDPG algorithm. It deeply analyzes and organizes the recursive small-experience-pool DDPG algorithm. An improved method is proposed to address the problem of high Q-value estimation in the DDPG algorithm, and feasibility verification is conducted. Finally, a first-order inverted pendulum model was built by Simscape Multibody in a simulation environment. The improved DDPG algorithm was simulated in the Matlab/Simulink environment and compared with the DDPG algorithm and the traditional PID controller.

By configuring corresponding neural networks and evaluation indicators and improving the experience pool structure and settlement method, the learning and training of DDPG and improved DDPG controllers for first-order inverted pendulum are completed. The simulation results show that under the control of the DDPG and PID controllers, the displacement and angular displacement of the first-order inverted pendulum are within the control range. Through comparison, DDPG controllers have better response speed and stronger stability and convergence compared to PID controllers, and they also have excellent anti-interference ability and stability recovery against sudden interference. After the structure is changed, the Critic network selects the minimum Q value to effectively solve the problem of overestimation of Q value, draws effective experience within the same training time, conducts efficient training, converges earlier under the same convergence conditions, and avoids local optima. Based on the analysis of training figures and experimental results, it can be seen that the improved DDPG algorithm has better stability and convergence and stronger anti-interference ability and stability recovery compared to DDPG.

This paper demonstrates the effectiveness of the reinforcement learning DDPG algorithm control and makes improvements to address the issues that arise. The improved DDPG has good autonomous optimization ability and is trained using trial and error methods to achieve control objectives in a way that is better and faster. This provides a reference for reinforcement learning to handle classical control problems and for its application in traditional control systems. There is still a lot of work to be performed to address classical control problems through reinforcement learning. First, adjusting the index weights in the reward function and training hyperparameters will help increase trial-and-error efficiency. Additionally, it is also helpful in introducing data-driven prior experience agents. Finally, deploying the trained agents into an industrial controller will solve the “reality gap” problem. These problems are all worthy of in-depth research.

Author Contributions

Conceptualization, H.H. and Y.C.; methodology, H.H.; software, T.W.; resources, T.W.; data curation, W.C.; writing—original draft preparation, Y.C.; writing—review and editing, H.H.; visualization, F.F.; supervision, F.F.; project administration, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (52262050). This research was funded by Science and Technology Project of Ganjiang Innovation Research Institute, Chinese Academy of Sciences (No. E255J001). This research was funded by Open Project of Jiangxi Province Key Laboratory of Maglev Technology (No. 204205100004).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data generated or analyzed during this study are included in this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chang, H.; Li, X.; Zhong, W. Based on Adaptive Fuzzy PID of Two-stage Inverted Pendulum Control Method. Fire Control Command Control 2022, 47, 108–113. [Google Scholar]
Liu, C.; Tao, Y.; Guo, S.; Chen, Y. Adaptive Integral Backstepping Control Strategy for Inverted Pendulum. Appl. Res. Comput. 2020, 37, 452–455. [Google Scholar] [CrossRef]
Liu, Q.; Zhai, J.; Zhang, Z.; Zhong, S.; Zhou, Q.; Zhang, P.; Xu, J. A Survey on Deep Reinforcement Learning. Chin. J. Comput. 2018, 41, 1–27. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. Comput. Sci. 2013, 12, 1–9. [Google Scholar]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML 2014), Beijing, China, 21–26 June 2014. [Google Scholar]
Li, Y.; Fang, Y.; Akhtar, Z. Accelerating Deep Reinforcement Learning Model for Game Strategy. Neurocomputing 2020, 408, 157–168. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. In Proceedings of the 4th International Conference on Learning Representations (ICLR) 2016, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Zhang, J.; Liu, Q. Deep Deterministic Policy Gradient with Episode Experience Replay. Comput. Sci. 2021, 48, 37–43. [Google Scholar]
Zhao, J.; Su, C.; Wang, Y. Research on Microservice Coordination Technologies based on Deep Reinforcement Learning. In Proceedings of the 2022 2nd International Conference on Electronic Information Technology and Smart Agriculture (ICEITSA), Huaihua, China, 9–11 December 2022. [Google Scholar]
Wang, Y.; Ren, T.; Fan, Z. Autonomous Maneuver Decision of UAV Based on Deep Reinforcement Learning: Comparison of DQN and DDPG. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 15–17 August 2022. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Hasselt, H.v.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the 13th AAAI Conference on Artificial Intelligence (AAAI 2016), Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.V.; Lanctot, M.; Freitas, N.D. Dueling Network Architectures for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
Liu, C.; Gao, J.; Bi, Y.; Shi, X.; Tian, D. A Multitasking-Oriented Robot Arm Motion Planning Scheme Based on Deep Reinforcement Learning and Twin Synchro-Control. Sensors 2020, 20, 3515. [Google Scholar] [CrossRef] [PubMed]
Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017. [Google Scholar]
Ahmad El Sallab, M.A.; Perot, E.; Yogamani, S. Deep Reinforcement Learning framework for Autonomous Driving. Statistics 2017, 29, art00012. [Google Scholar] [CrossRef] [Green Version]
Liu, S.; Lin, Q.; Yang, Z.; Wu, Y.; Zhai, Y. Balance Control of Two-wheeled Robot Based on Deep Deterministic Policy Gradient. Mech. Eng. 2020, 345, 142–144. [Google Scholar]
Liu, Y.; Li, X.; Jiang, P.; Sun, B.; Wu, Z.; Jiang, X.; Qian, S. Research on Robot Dynamic Target Tracking and Obstacle Avoidance Control Based on DDPG–PID. J. Nanjing Univ. Aeronaut. Astronaut. 2022, 54, 41–50. [Google Scholar] [CrossRef]
Wang, Y.; Chen, S.; Huang, H. Inverted Pendulum Controller Based on Improved Deep Reinforcement Learning. Control Eng. China 2022, 29, 2018–2026. [Google Scholar] [CrossRef]
Xue, H.; Zhe, F.; Fang, Q.; Liu, X. Reinforcement learning based fractional gradient descent RBF neural network control of inverted pendulum. Control Decis. 2021, 36, 125–134. [Google Scholar] [CrossRef]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Harley, T.; Lillicrap, T.P.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018; pp. 13–15. [Google Scholar]
Zhang, S.; Sutton, R.S. A deeper look at experience replay. arXiv 2017, arXiv:1712.01275. [Google Scholar]
Hasselt, H.V.; Mahmood, A.R.; Sutton, R.S. Off-policy TD(λ) with a true online equivalence. In Proceedings of the 30th Conference on Uncertainty in Artifical Intelligence (UAI 2014), Quebec, QC, Canada, 23–27 July 2014. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018. [Google Scholar]

Figure 1. DDPG flowchart.

Figure 2. Markov decision process.

Figure 3. DQN algorithm flowchart.

Figure 4. Actor–Critic algorithm flowchart.

Figure 5. Improved DDPG flowchart.

Figure 6. Experimental environment.

Figure 7. DDPG training figure.

Figure 8. Improved DDPG training figure.

Figure 9. DDPG white noise experimental results.

Figure 10. Improved DDPG white noise experiment results.

Figure 11. PID white noise experiment results.

Figure 12. DDPG interference experiment results.

Figure 13. Improved DDPG interference experiment results.

Figure 14. PID interference experiment results.

Table 1. Inverted Pendulum Parameters.

Parameter	Value
Pendulum mass (kg)	0.2
Cart mass (kg)	0.46
Length from centroid to hinge point (m)	0.25
frictional coefficient	0.08
Gravitational acceleration (m × s²)	9.81

Table 2. DDPG Training Hyperparameters.

Hyperparameter	Value
Critic network learning rate	0.001
Actor network learning rate	0.0005
Critic network structure	(5 × 128 × 200 + 1 × 200) × 1
Actor network structure	5 × 128 × 200 × 1
Discount factor	0.995
Target update frequency	10
Update interval	100
Soft update parameters	0.001
Activation function	ReLU

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, H.; Chen, Y.; Wang, T.; Feng, F.; Chen, W. Research on the Deep Deterministic Policy Algorithm Based on the First-Order Inverted Pendulum. Appl. Sci. 2023, 13, 7594. https://doi.org/10.3390/app13137594

AMA Style

Hu H, Chen Y, Wang T, Feng F, Chen W. Research on the Deep Deterministic Policy Algorithm Based on the First-Order Inverted Pendulum. Applied Sciences. 2023; 13(13):7594. https://doi.org/10.3390/app13137594

Chicago/Turabian Style

Hu, Hailin, Yuhui Chen, Tao Wang, Fu Feng, and Weijin Chen. 2023. "Research on the Deep Deterministic Policy Algorithm Based on the First-Order Inverted Pendulum" Applied Sciences 13, no. 13: 7594. https://doi.org/10.3390/app13137594

APA Style

Hu, H., Chen, Y., Wang, T., Feng, F., & Chen, W. (2023). Research on the Deep Deterministic Policy Algorithm Based on the First-Order Inverted Pendulum. Applied Sciences, 13(13), 7594. https://doi.org/10.3390/app13137594

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on the Deep Deterministic Policy Algorithm Based on the First-Order Inverted Pendulum

Abstract

1. Introduction

2. Construction of a Control System Based on DDPG

2.1. Markov Decision Process

2.2. Actor–Critic Algorithm

3. Improved DDPG Based on Local Optimization and Q-Value Overestimation

3.1. Recursive Small Experience Pool DDPG Algorithm

3.2. Optimization Design of Critic Network Structure

4. Experimental Results and Analysis

4.1. Simulation Environment

4.2. Results and Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI