Integrating Risk-Averse and Constrained Reinforcement Learning for Robust Decision-Making in High-Stakes Scenarios
Abstract
:1. Introduction
Research Contributions
- To prove the strong duality of spectral risk-averse MDPs with expectation-based cumulative and/or deterministic instantaneous constraints without making any assumption on the problem’s convexity.
- To propose a constraint handling mechanism for risk-averse reinforcement learning based on the strong duality results.
- To theoretically and empirically establish the convergence of the proposed constraint handling mechanism.
2. Preliminaries
2.1. Action Space
2.2. State Space
2.3. Policy Space
2.4. Reward and Constraint Functions
2.5. Discount Factor
2.6. Reference Equations
3. Risk-Averse Markov Decision Processes with Cumulative Constraints
Strong Duality for “Problem 1”
- Expectation-Based Constraints
- Risk-Averse Objective
4. Risk-Averse Markov Decision Processes with Instantaneous Constraints
Strong Duality for “Problem 2”
5. Augmented Lagrangian-Based Constraint Handling Mechanism
5.1. Clipping Method
5.2. Surrogate Objective
5.3. Reward Function
5.4. Developed Mechanism
Algorithm 1: Augmented Lagrangian-based constrained risk-averse RL | ||
1 | Take and dual assent step size , and initialize and | |
2 | Initialize RL policy | |
3 | For (where, defines the number of iterations in which Lagrangian multipliers and quadratic penalty coefficients need to be updated) | |
4 | ||
5 | ||
6 | End |
5.5. Theoretical Results for Convergence
6. Numerical Example
7. Discussion
Methodical Contribution of This Research
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Expectation-Based Constraints
Appendix B
References
- Wang, D.; Yang, K.; Yang, L. Risk-averse two-stage distributionally robust optimisation for logistics planning in disaster relief management. Int. J. Prod. Res. 2023, 61, 668–691. [Google Scholar] [CrossRef]
- Habib, M.S.; Maqsood, M.H.; Ahmed, N.; Tayyab, M.; Omair, M. A multi-objective robust possibilistic programming approach for sustainable disaster waste management under disruptions and uncertainties. Int. J. Disaster Risk Reduct. 2022, 75, 102967. [Google Scholar] [CrossRef]
- Habib, M.S. Robust Optimization for Post-Disaster Debris Management in Humanitarian Supply Chain: A Sustainable Recovery Approach. Ph.D. Thesis, Hanyang University, Seoul, Republic of Korea, 2018. [Google Scholar]
- Hussain, A.; Masood, T.; Munir, H.; Habib, M.S.; Farooq, M.U. Developing resilience in disaster relief operations management through lean transformation. Prod. Plan. Control 2023, 34, 1475–1496. [Google Scholar] [CrossRef]
- Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Yang, Y.; Knoll, A. A Review of Safe Reinforcement Learning: Methods, Theory and Applications. arXiv 2022, arXiv:2205.10330. [Google Scholar]
- Wang, Y.; Zhan, S.S.; Jiao, R.; Wang, Z.; Jin, W.; Yang, Z.; Wang, Z.; Huang, C.; Zhu, Q. Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments. In Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, Honolulu, HI, USA, 23–29 July 2023; pp. 36593–36604. Available online: https://proceedings.mlr.press/v202/wang23as.html (accessed on 26 May 2024).
- Yang, Q.; Simão, T.D.; Tindemans, S.H.; Spaan, M.T.J. Safety-constrained reinforcement learning with a distributional safety critic. Mach. Learn. 2023, 112, 859–887. [Google Scholar] [CrossRef]
- Yin, X.; Büyüktahtakın, İ.E. Risk-averse multi-stage stochastic programming to optimizing vaccine allocation and treatment logistics for effective epidemic response. IISE Trans. Healthc. Syst. Eng. 2022, 12, 52–74. [Google Scholar] [CrossRef]
- Morillo, J.L.; Zéphyr, L.; Pérez, J.F.; Lindsay Anderson, C.; Cadena, Á. Risk-averse stochastic dual dynamic programming approach for the operation of a hydro-dominated power system in the presence of wind uncertainty. Int. J. Electr. Power Energy Syst. 2020, 115, 105469. [Google Scholar] [CrossRef]
- Yu, G.; Liu, A.; Sun, H. Risk-averse flexible policy on ambulance allocation in humanitarian operations under uncertainty. Int. J. Prod. Res. 2021, 59, 2588–2610. [Google Scholar] [CrossRef]
- Escudero, L.F.; Garín, M.A.; Monge, J.F.; Unzueta, A. On preparedness resource allocation planning for natural disaster relief under endogenous uncertainty with time-consistent risk-averse management. Comput. Oper. Res. 2018, 98, 84–102. [Google Scholar] [CrossRef]
- Coache, A.; Jaimungal, S.; Cartea, Á. Conditionally Elicitable Dynamic Risk Measures for Deep Reinforcement Learning. SSRN Electron. J. 2023, 14, 1249–1289. [Google Scholar] [CrossRef]
- Zhuang, X.; Zhang, Y.; Han, L.; Jiang, J.; Hu, L.; Wu, S. Two-stage stochastic programming with robust constraints for the logistics network post-disruption response strategy optimization. Front. Eng. Manag. 2023, 10, 67–81. [Google Scholar] [CrossRef]
- Habib, M.S.; Sarkar, B. A multi-objective approach to sustainable disaster waste management. In Proceedings of the International Conference on Industrial Engineering and Operations Management, Paris, Farance, 26–27 July 2018; pp. 1072–1083. [Google Scholar]
- Shapiro, A.; Tekaya, W.; da Costa, J.P.; Soares, M.P. Risk neutral and risk averse Stochastic Dual Dynamic Programming method. Eur. J. Oper. Res. 2013, 224, 375–391. [Google Scholar] [CrossRef]
- Yu, L.; Zhang, C.; Jiang, J.; Yang, H.; Shang, H. Reinforcement learning approach for resource allocation in humanitarian logistics. Expert Syst. Appl. 2021, 173, 114663. [Google Scholar] [CrossRef]
- Ahmadi, M.; Rosolia, U.; Ingham, M.; Murray, R.; Ames, A. Constrained Risk-Averse Markov Decision Processes. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
- Lockwood, P.L.; Klein-Flügge, M.C. Computational modelling of social cognition and behaviour—A reinforcement learning primer. Soc. Cogn. Affect. Neurosci. 2020, 16, 761–771. [Google Scholar] [CrossRef] [PubMed]
- Collins, A.G.E. Reinforcement learning: Bringing together computation and cognition. Curr. Opin. Behav. Sci. 2019, 29, 63–68. [Google Scholar] [CrossRef]
- Zabihi, Z.; Moghadam, A.M.E.; Rezvani, M.H. Reinforcement Learning Methods for Computing Offloading: A Systematic Review. ACM Comput. Surv. 2023, 56, 17. [Google Scholar] [CrossRef]
- Liu, P.; Zhang, Y.; Bao, F.; Yao, X.; Zhang, C. Multi-type data fusion framework based on deep reinforcement learning for algorithmic trading. Appl. Intell. 2023, 53, 1683–1706. [Google Scholar] [CrossRef]
- Shavandi, A.; Khedmati, M. A multi-agent deep reinforcement learning framework for algorithmic trading in financial markets. Expert Syst. Appl. 2022, 208, 118124. [Google Scholar] [CrossRef]
- Basso, R.; Kulcsár, B.; Sanchez-Diaz, I.; Qu, X. Dynamic stochastic electric vehicle routing with safe reinforcement learning. Transp. Res. Part E Logist. Transp. Rev. 2022, 157, 102496. [Google Scholar] [CrossRef]
- Lee, J.; Lee, K.; Moon, I. A reinforcement learning approach for multi-fleet aircraft recovery under airline disruption. Appl. Soft Comput. 2022, 129, 109556. [Google Scholar] [CrossRef]
- Shi, T.; Xu, C.; Dong, W.; Zhou, H.; Bokhari, A.; Klemeš, J.J.; Han, N. Research on energy management of hydrogen electric coupling system based on deep reinforcement learning. Energy 2023, 282, 128174. [Google Scholar] [CrossRef]
- Venkatasatish, R.; Dhanamjayulu, C. Reinforcement learning based energy management systems and hydrogen refuelling stations for fuel cell electric vehicles: An overview. Int. J. Hydrogen Energy 2022, 47, 27646–27670. [Google Scholar] [CrossRef]
- Demizu, T.; Fukazawa, Y.; Morita, H. Inventory management of new products in retailers using model-based deep reinforcement learning. Expert Syst. Appl. 2023, 229, 120256. [Google Scholar] [CrossRef]
- Wang, K.; Long, C.; Ong, D.J.; Zhang, J.; Yuan, X.M. Single-Site Perishable Inventory Management Under Uncertainties: A Deep Reinforcement Learning Approach. IEEE Trans. Knowl. Data Eng. 2023, 35, 10807–10813. [Google Scholar] [CrossRef]
- Waubert de Puiseau, C.; Meyes, R.; Meisen, T. On reliability of reinforcement learning based production scheduling systems: A comparative survey. J. Intell. Manuf. 2022, 33, 911–927. [Google Scholar] [CrossRef]
- Hildebrandt, F.D.; Thomas, B.W.; Ulmer, M.W. Opportunities for reinforcement learning in stochastic dynamic vehicle routing. Comput. Oper. Res. 2023, 150, 106071. [Google Scholar] [CrossRef]
- Dalal, G.; Dvijotham, K.; Vecerík, M.; Hester, T.; Paduraru, C.; Tassa, Y.J.A. Safe Exploration in Continuous Action Spaces. arXiv 2018, arXiv:1801.08757. [Google Scholar]
- Altman, E. Constrained Markov Decision Processes; Routledge: London, UK, 1999. [Google Scholar]
- Borkar, V.S. An actor-critic algorithm for constrained Markov decision processes. Syst. Control Lett. 2005, 54, 207–213. [Google Scholar] [CrossRef]
- Paternain, S.; Chamon, L.F.O.; Calvo-Fullana, M.; Ribeiro, A. Constrained reinforcement learning has zero duality gap. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: New York, NY, USA, 2019; p. 679. [Google Scholar]
- Chow, Y.; Ghavamzadeh, M.; Janson, L.; Pavone, M. Risk-constrained reinforcement learning with percentile risk criteria. J. Mach. Learn. Res. 2017, 18, 6070–6120. [Google Scholar]
- Chow, Y.; Nachum, O.; Duenez-Guzman, E.; Ghavamzadeh, M. A lyapunov-based approach to safe reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 2–8 December 2018. [Google Scholar]
- Chen, X.; Karimi, B.; Zhao, W.; Li, P. On the Convergence of Decentralized Adaptive Gradient Methods. arXiv 2021, arXiv:2109.03194. Available online: https://ui.adsabs.harvard.edu/abs/2021arXiv210903194C (accessed on 26 May 2024).
- Rao, J.J.; Ravulapati, K.K.; Das, T.K. A simulation-based approach to study stochastic inventory-planning games. Int. J. Syst. Sci. 2003, 34, 717–730. [Google Scholar] [CrossRef]
- Dinh Thai, H.; Nguyen Van, H.; Diep, N.N.; Ekram, H.; Dusit, N. Markov Decision Process and Reinforcement Learning. In Deep Reinforcement Learning for Wireless Communications and Networking: Theory, Applications and Implementation; Wiley-IEEE Press: Hoboken, NJ, USA, 2023; pp. 25–36. [Google Scholar]
- Bakker, H.; Dunke, F.; Nickel, S. A structuring review on multi-stage optimization under uncertainty: Aligning concepts from theory and practice. Omega 2020, 96, 102080. [Google Scholar] [CrossRef]
- Liu, K.; Yang, L.; Zhao, Y.; Zhang, Z.-H. Multi-period stochastic programming for relief delivery considering evolving transportation network and temporary facility relocation/closure. Transp. Res. Part E Logist. Transp. Rev. 2023, 180, 103357. [Google Scholar] [CrossRef]
- Kamyabniya, A.; Sauré, A.; Salman, F.S.; Bénichou, N.; Patrick, J. Optimization models for disaster response operations: A literature review. OR Spectr. 2024, 46, 1–47. [Google Scholar] [CrossRef]
- Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1997. (In English) [Google Scholar]
- Dowd, K.; Cotter, J. Spectral Risk Measures and the Choice of Risk Aversion Function. arXiv 2011, arXiv:1103.5668. [Google Scholar]
- Borkar, V.S. A convex analytic approach to Markov decision processes. Probab. Theory Relat. Fields 1988, 78, 583–602. [Google Scholar] [CrossRef]
- Nguyen, N.D.; Nguyen, T.T.; Vamplew, P.; Dazeley, R.; Nahavandi, S. A Prioritized objective actor-critic method for deep reinforcement learning. Neural Comput. Appl. 2021, 33, 10335–10349. [Google Scholar] [CrossRef]
- Li, J.; Fridovich-Keil, D.; Sojoudi, S.; Tomlin, C.J. Augmented Lagrangian Method for Instantaneously Constrained Reinforcement Learning Problems. In Proceedings of the 2021 60th IEEE Conference on Decision and Control (CDC), Austin, TX, USA, 14–17 December 2021; pp. 2982–2989. [Google Scholar]
- Boland, N.; Christiansen, J.; Dandurand, B.; Eberhard, A.; Oliveira, F. A parallelizable augmented Lagrangian method applied to large-scale non-convex-constrained optimization problems. Math. Program. 2019, 175, 503–536. [Google Scholar] [CrossRef]
- Yu, L.; Yang, H.; Miao, L.; Zhang, C. Rollout algorithms for resource allocation in humanitarian logistics. IISE Trans. 2019, 51, 887–909. [Google Scholar] [CrossRef]
- Rodríguez-Espíndola, O. Two-stage stochastic formulation for relief operations with multiple agencies in simultaneous disasters. OR Spectr. 2023, 45, 477–523. [Google Scholar] [CrossRef]
- Zhang, L.; Shen, L.; Yang, L.; Chen, S.; Wang, X.; Yuan, B.; Tao, D. Penalized Proximal Policy Optimization for Safe Reinforcement Learning. arXiv 2022, arXiv:2205.11814, 3719–3725. [Google Scholar]
- Ding, S.; Wang, J.; Du, Y.; Shi, Y. Reduced Policy Optimization for Continuous Control with Hard Constraints. arXiv 2023, arXiv:2310.09574. [Google Scholar]
- Wang, Z.; Shi, X.; Ma, C.; Wu, L.; Wu, J. CCPO: Conservatively Constrained Policy Optimization Using State Augmentation; IOS Press: Amsterdam, The Netherlands, 2023. [Google Scholar]
- Peng, X.B.; Abbeel, P.; Levine, S.; Panne, M.V.D. DeepMimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. 2018, 37, 143. [Google Scholar] [CrossRef]
- Tamar, A.; Castro, D.D.; Mannor, S. Policy gradients with variance related risk criteria. In Proceedings of the 29th International Coference on International Conference on Machine Learning, Edinburgh, UK, 26 June–1 July 2012. [Google Scholar]
- Tamar, A.; Mannor, S. Variance Adjusted Actor Critic Algorithms. arXiv 2013, arXiv:1310.3697. [Google Scholar]
- Dowson, O.; Kapelevich, L. SDDP.jl: A Julia Package for Stochastic Dual Dynamic Programming. INFORMS J. Comput. 2021, 33, 27–33. [Google Scholar] [CrossRef]
- Boda, K.; Filar, J.A. Time Consistent Dynamic Risk Measures. Math. Methods Oper. Res. 2006, 63, 169–186. [Google Scholar] [CrossRef]
- Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, Lille, France, 6–11 July 2015; Available online: https://proceedings.mlr.press/v37/schulman15.html (accessed on 26 May 2024).
- Gillies, A.W. Some Aspects of Analysis and Probability. Phys. Bull. 1959, 10, 65. [Google Scholar] [CrossRef]
- Van Wassenhove, L.N. Humanitarian aid logistics: Supply chain management in high gear. J. Oper. Res. Soc. 2006, 57, 475–489. [Google Scholar] [CrossRef]
- Yu, L.; Zhang, C.; Yang, H.; Miao, L. Novel methods for resource allocation in humanitarian logistics considering human suffering. Comput. Ind. Eng. 2018, 119, 1–20. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahmad, M.; Ramzan, M.B.; Omair, M.; Habib, M.S. Integrating Risk-Averse and Constrained Reinforcement Learning for Robust Decision-Making in High-Stakes Scenarios. Mathematics 2024, 12, 1954. https://doi.org/10.3390/math12131954
Ahmad M, Ramzan MB, Omair M, Habib MS. Integrating Risk-Averse and Constrained Reinforcement Learning for Robust Decision-Making in High-Stakes Scenarios. Mathematics. 2024; 12(13):1954. https://doi.org/10.3390/math12131954
Chicago/Turabian StyleAhmad, Moiz, Muhammad Babar Ramzan, Muhammad Omair, and Muhammad Salman Habib. 2024. "Integrating Risk-Averse and Constrained Reinforcement Learning for Robust Decision-Making in High-Stakes Scenarios" Mathematics 12, no. 13: 1954. https://doi.org/10.3390/math12131954
APA StyleAhmad, M., Ramzan, M. B., Omair, M., & Habib, M. S. (2024). Integrating Risk-Averse and Constrained Reinforcement Learning for Robust Decision-Making in High-Stakes Scenarios. Mathematics, 12(13), 1954. https://doi.org/10.3390/math12131954