Temporal Consistency-Based Loss Function for Both Deep Q-Networks and Deep Deterministic Policy Gradients for Continuous Actions
Abstract
:1. Introduction
Pseudo-Code 1 DRL-based AI agent using both DQN and DDPG |
|
2. Notation and Background
2.1. Markov Decision Process (MDP)
Algorithm 1 Standard TD-Learning |
Initialize θo randomly and Set θ’o = θo For iteration k = 0, 1 … do Sample s∼d(·) and α∼π(s,·) Sample s’ and r(s, a) Let gk = ϕ(s)(r(s,a) + γϕ(s’)Tθ’k − ϕ(s)·T θk) Update θk+1 = θk − αkgk Update θ’k+1 = θk+1 End for |
2.2. Deep Q-Network (DQN)
Algorithm 2 DQN with experience replay |
Initialize replay memory D Initialize Q with the weight θ Q for action-value function Initialize Q−with the weight θ Q− = θ Q for target-net For episode = 1, M do Initialize sequence S1 = {x1} and pre−processed sequence ϕ1 = ϕ(S1) For t = 1, T do With probability ϵ select a random action at Otherwise select at = argmaxaQ(ϕ(St),at; θ Q) Execute action at and observe reward rt and new state st+1 and Pre−process ϕt+1 = ϕst+1 Store transition (ϕt, at, rt, ϕt+1) in D Sample random mini−batch of transitions (ϕi, ai, ri, ϕi+1) from D Set yi = ri if i + 1 = terminate yi = ri + γmaxa’Q -(ϕi+1,a’;θQ−) otherwise Perform a gradient descent step on (yi−Q(si,ai;θ Q))2 With respect to the network parameters θQ Every C steps reset Q− = Q End for End for |
2.3. Deep Deterministic Policy Gradient (DDPG)
Algorithm 3 DDPG |
Initialize replay memory D Initialize Q with the weight θQ for critic-net Initialize Q− with the weight θQ − = θQ for target-net of Q Initialize µ with the weight θμ for actor-net Initialize μ− with the weight θμ− = θμ for target-net of μ For episode = 1, M do Initialize a random process N for action exploration Initialize observation state s1 For t = 1, T do Select action at = μ(st|θμ) = Nt according to the current policy and exploration noise Execute action at and observe reward rt and new state st + 1 Store transition (st, at, rt, st + 1) in D Set yi = ri + γQ−(si + 1, μ−(si + 1|θμ− )|θQ−) Update critic-net by minimizing the loss: L = Σi (yi − Q(st, at)| θQ)2 Update actor-net by using the sampled policy gradient: ∇θμ J ≈ Σi ∇aQ(si, μ(si)| θQ) ∇θμμ(si|θμ) Update the target-nets: θμ− = τθμ + (1 − τ)θμ− θQ− = τθQ + (1 − τ )θQ− End for End for |
3. Proposed TC Loss Functions for Both DQN and DDPG
3.1. Previously Developed Loss Functions
3.2. Newly Proposed TC Loss Functions
Algorithm 4 The proposed algorithm with TC-DQN and TC-DDPG |
Initialize replay memory D Initialize Q with the weight θQ for both action-value function and critic-net in both DQN and DDPG Initialize Q− with the weight θQ− = θQ for target-nets in both DQN and DDPG Initialize µ with the weight θμ for actor-net in DDPG Initialize μ− with the weight θμ− = θμ for target-net of μ in DDPG For episode = 1 , M do Initialize a random process N for DDPG Initialize observation state s1 For t = 1, T do Derive the action in DQN at with probability ϵ or at = argmaxaQ(st, a; θQ) or in DDPG at = μ(st|θμ) + Nt Execute action at and observe reward rt and new state st+1 Store transition (st, at, rt, st+1) in D Sample a random mini-batch of transitions (si, ai, ri, si+1) from D In DQN Set yi = ri if i + 1 = terminate yi = ri + γmaxa’ Q−(si+1, a’; θQ−) otherwise Update action-value function on (yi − Q(si, ai;θQ))2 Update target-net with the additional subtraction(TC-DQN) on −Σi)(Q−()(si+1, maxa’Q−(si+1, a’))- Q−( − 1)(si+1, maxa’Q−(si+1, a’)))2 In DDPG Set yi = ri + γQ−(si+1, μ−(si+1|θμ−)|θQ−) Update critic-net on Σi ( yi − Q(si, ai)|θQ))2 Update actor-net onΣi ∇aQ(si, μ(si)|θQ) ∇θμ μ(si| θμ) Update target-net for actor-net on θμ− = τθμ + (1 − τ )θμ− Update target-net for ciritic-net with the additional subtraction(TC-DDPG) on Σi (Q−()(si+1, μ−(si+1|θμ−) +Ni+1 − Q−(−1)(si+1, μ−(si+1|θμ−) +Ni+1))2 End For End For |
4. Evaluation and Results
4.1. “Cart-Pole”
4.2. “Pendulum”
5. Discussion
6. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 1. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Driessche, G.V.D.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M. Mastering the game of go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.; Veness, J.; Bellemare, M.; Graves, A.; Riedmiller, M.; Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016; Available online: https://arxiv.org/abs/1509.02971 (accessed on 1 July 2021).
- Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Silver, D.; Lever, G.; Heess, N.; Degris, T.; Weirstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the ICML’14 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Volume 32, pp. I-387–I-395. [Google Scholar]
- Marcus, G. Deep Learning: A Critical Appraisal. arXiv 2019, arXiv:1801.00631. [Google Scholar]
- Duan, J.; Shi, D.; Diao, R.; Li, H.; Wang, Z.; Zhang, B.; Bian, D.; Yi, Z. Deep-Reinforcement-Learning-Based Autonomous Voltage Control for Power Grid Operations. IEEE Trans. Power Syst. 2020, 35, 814–817. [Google Scholar] [CrossRef]
- Schreiber, T.; Eschweiler, S.; Baranski, M.; Müller, D. Application of two promising Reinforcement Learning algorithms for load shifting in a cooling supply system. Energy Build. 2020, 229, 110490. [Google Scholar] [CrossRef]
- Lin, L.-J. Reinforcement Learning for Robots Using Neural Networks; Technical Report; School of Computer Science, Carnegie-Mellon University: Pittsburgh, PA, USA, 1993. [Google Scholar]
- Kim, S.; Asadi, K.; Littman, M.; Konidaris, G. DeepMellow: Removing the need for a target network in deep Q-learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 2733–2739. [Google Scholar] [CrossRef] [Green Version]
- Durugkar, I.; Stone, P. TD Learning with Constrained Gradients. In Proceedings of the 31st Conference on Neural Information Processing Systems, NeurIPS 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Pohlen, T.; Piot, B.; Hester, T.; Azar, M.-G.; Horgan, D.; Budden, D.; Barth-Maron, G.; Hasselt, H.-V.; Quan, J.; Vecerík, M.; et al. Observe and look further: Achieving consistent performance on Atari. arXiv 2018, arXiv:1805.11593. [Google Scholar]
- Ohnishi, S.; Uchibe, E.; Yamaguchi, Y.; Nakanishi, K.; Yasui, Y.; Ishii, S. Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning. Front. Neurorobot. 2019, 13, 103. [Google Scholar] [CrossRef] [PubMed]
- Lee, D.; He, N. Target-Based Temporal-Difference Learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Hasselt, H.V.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-Learning. In Proceedings of the AAAI’16: 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
- Pavse, B.; Durugkar, I.; Hanna, J.; Stone, P. Reducing Sampling Error in Batch Temporal Difference Learning. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020; Volume 119, pp. 7543–7552. [Google Scholar]
- Uhlenbeck, G.E.; Ornstein, L.S. On the Theory of the Brownian Motion. Phys. Rev. 1930, 36, 82. [Google Scholar] [CrossRef]
- Cart-Pole. Available online: https://gym.openai.com/envs/CartPole-v1/ (accessed on 28 September 2021).
- cartpole_dqn.py. Available online: https://github.com/rlcode/reinforcement-learning-kr/blob/master/2-cartpole/1-dqn/cartpole_dqn.py (accessed on 28 September 2021).
- Tensorflow. Available online: https://github.com/tensorflow/tensorflow (accessed on 28 September 2021).
- Keras. Available online: https://keras.io/ (accessed on 28 September 2021).
- OpenAI Gym. Available online: https://gym.openai.com/ (accessed on 28 September 2021).
- Pendulum-V0. Available online: https://github.com/openai/gym/wiki/Pendulum-v0 (accessed on 28 September 2021).
- pendulum_ddpg.py. Available online: https://github.com/dnddnjs/pendulum_ddpg/blob/master/pendulum_ddpg.py (accessed on 28 September 2021).
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, C. Temporal Consistency-Based Loss Function for Both Deep Q-Networks and Deep Deterministic Policy Gradients for Continuous Actions. Symmetry 2021, 13, 2411. https://doi.org/10.3390/sym13122411
Kim C. Temporal Consistency-Based Loss Function for Both Deep Q-Networks and Deep Deterministic Policy Gradients for Continuous Actions. Symmetry. 2021; 13(12):2411. https://doi.org/10.3390/sym13122411
Chicago/Turabian StyleKim, Chayoung. 2021. "Temporal Consistency-Based Loss Function for Both Deep Q-Networks and Deep Deterministic Policy Gradients for Continuous Actions" Symmetry 13, no. 12: 2411. https://doi.org/10.3390/sym13122411
APA StyleKim, C. (2021). Temporal Consistency-Based Loss Function for Both Deep Q-Networks and Deep Deterministic Policy Gradients for Continuous Actions. Symmetry, 13(12), 2411. https://doi.org/10.3390/sym13122411