ENSURING SAFETY WHILE ENHANCING PERFORMANCE: ENCOURAGING REINFORCEMENT LEARNING BY ADDRESSING CONSTRAINTS AND UNCERTAINTY

Mohsen Abdollahzadeh Aghbolagh

Аннотация


Striking a balance

Ключевые слова


safe reinforcement learning constraint, off-policy, exploration, risk assessment.

Полный текст:

PDF (English)

Литература


[1] G. Li, Y. Yang, S. Li, X. Qu, N. Lyu, S. E. Li, Decision making of autonomous vehicles in lane change scenarios: Deep reinforcement learning approaches with risk awareness, Transportation Research Part C: Emerging Technologies 134 (2022) 103452. DOI: https://doi.org/10.1016/j.trc.2021.103452.

[2] C. Shiranthika, K. -W. Chen, C. -Y. Wang, C. -Y. Yang, B. H. Sudantha and W. -F. Li. Supervised Optimal Chemotherapy Regimen Based on Offline Reinforcement Learning. IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 9, p. 4763–4772, Sept. 2022. DOI: 10.1109/JBHI.2022.3183854.

[3] M.Abdollahzadeh, Yahya Dorostkar Navaei, Anomaly detection in heart disease using a density based unsupervised approach. DOI: https://doi.org/10.1155/2022/6913043.

[4] M. Selim, A. Alanwar, S. Kousik, G. Gao, M. Pavone and K. H. Johansson. Safe Reinforcement Learning Using Black-Box Reachability Analysis. IEEE Robotics and Automation Letters, vol. 7, no. 4, p. 10665–10672, Oct. 2022. DOI: 10.1109/LRA.2022.3192205.

[5] Y. Chow, M. Ghavamzadeh, L. Janson, M. Pavone, Risk-constrained reinforcement learning with percentile risk criteria, J. Mach. Learn. Res. 18 (2017) 60706120. URL: https://archive.org/details/arxiv-1512.01629 (accessed: 01.04.2024).

[6] C. Gehring, D. Precup, Smart exploration in reinforcement learning using absolute temporal difference errors, in: Proceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS ’13, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 2013, p. 10371044. URL: https://www.researchgate.net/publication/262164966_Smart_exploration_in_reinforcement_learning_using_absolute_temporal_difference_errors (accessed: 01.04.2024).

[7] S. Fujimoto, D. Meger, D. Precup, Off-policy deep reinforcement learning without exploration, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, PMLR. 2019, p. 2052–2062.
URL: https://proceedings.mlr.press/v97/fujimoto19a.html (accessed: 01.04.2024).

[8] J. García, Fern, o Fernández, A comprehensive survey on safe reinforcement learning, Journal of Machine Learning Research 16 (2015), p. 1437–1480. URL: https://www.semanticscholar.org/paper/A-comprehensive-survey-on-safe-reinforcement-García-Fernández/c0f2c4104ef6e36bb67022001179887e6600d24d (accessed: 01.04.2024).

[9] A. Kumar, A. Zhou, G. Tucker, S. Levine, Conservative q-learning for offline reinforcement learning, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc. 2020, p. 1179–1191.
URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/0d2b2061826a5df3221116a5085a6052-Paper.pdf (accessed: 01.04.2024).

[10] Xu, H., Zhan, X., & Zhu, X. (2022). Constraints Penalized Q-learning for Safe Offline Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 36(8), p. 8753–8760.
DOI: https://doi.org/10.1609/aaai.v36i8.20855.

[11] G. Thomas, Y. Luo, T. Ma, Safe reinforcement learning by imagining the near future, in: M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems, volume 34, Curran Associates, Inc. 2021, p. 13859–13869.
URL: https://proceedings.neurips.cc/paper_files/paper/2021/file/73b277c11266681122132d024f53a75b-Paper.pdf. (accessed: 01.04.2024).

[12] Y. J. Ma, A. Shen, O. Bastani, J. Dinesh, Conservative and adaptive penalty for model-based safe reinforcement learning, in: AAAI. 2022, p. 5404–5412. DOI: 10.1609/aaai.v36i5.20478.

[13] M. L. Littman, Value-function reinforcement learning in markov games, Cognitive Systems Research 2 (2001) 55–66. DOI: https://doi.org/10.1016/S1389-0417(01)00015-8.

[14] Luengo, D., Martino, L., Bugallo, M. et al. A survey of Monte Carlo methods for parameter estimation. EURASIP J. Adv. Signal Process. 25 (2020). DOI: https://doi.org/10.1186/s13634-020-00675-6.

[15] L. Wang, Z. Tong, B. Ji, G. Wu, Tdn: Temporal difference networks for efficient action recognition, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021, p. 1895–1904.
DOI: 10.1109/CVPR46437.2021.00193.

[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing atari with deep reinforcement learning. DOI: https://doi.org/10.48550/arXiv.1312.5602.

[17] van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1).
DOI: https://doi.org/10.1609/aaai.v30i1.10295.

[18] Clifton, Jesse and Laber, Eric B., Q-Learning: Theory and Applications (March 2020). Annual Review of Statistics and Its Application. V. 7, Issue 1, p. 279–301, 2020.
DOI: http://dx.doi.org/10.1146/annurev-statistics-031219-041220.

[19] J. Schulman, S. Levine, P. Moritz, M. Jordan, P. Abbeel, Trust region policy optimization, in: Proceedings of the 32nd International Conference on International Conference on Machine Learning – Vol. 37, ICML’15, JMLR.org, 2015, p. 1889–1897. URL: https://proceedings.mlr.press/v37/schulman15.pdf (accessed: 01.04.2024).

[20] N. Vieillard, T. Kozuno, B. Scherrer, O. Pietquin, R. Munos, M. Geist, Leverage the average: an analysis of kl regularization in reinforcement learning, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, vol. 33, Curran Associates, Inc. 2020, p. 12163–12174. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/8e2c381d4dd04f1c55093f22c59c3a08-Paper.pdf (accessed: 01.04.2024).

[21] V. Konda, J. Tsitsiklis, Actor-critic algorithms, in: S. Solla, T. Leen, K. Müller (Eds.), Advances in Neural Information Processing Systems, volume 12, MIT Press, 1999.
URL: https://proceedings.neurips.cc/paper_files/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf (accessed: 01.04.2024).

[22] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, in: J. Dy, A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, PMLR. 2018, p. 1861–1870.
URL: https://proceedings.mlr.press/v80/haarnoja18b.html (accessed: 01.04.2024).

[23] E. Altman, Constrained Markov Decision Processes, 1st ed., Routledge, 1999. DOI: 10.1201/9781315140223.

[24] J. Taylor, Project scheduling and cost control: planning, monitoring and controlling the baseline, J. Ross Publishing, 2008. – 280 p.

[25] K. R. MacCrimmon, C. A. Ryavec, An analytical study of the pert assumptions, Operations Research 12 (1964) 16–37.
DOI: https://doi.org/10.1287/opre.12.1.16.

[26] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller, Deterministic policy gradient algorithms, in: E. P. Xing, T. Jebara (Eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, PMLR, Bejing, China. 2014, p. 387–395.
URL: https://proceedings.mlr.press/v32/silver14.html (accessed: 01.04.2024).

[27] Lockwood, O., & Si, M. (2022). A Review of Uncertainty for Deep Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 18(1),
p. 155–162. DOI: https://doi.org/10.1609/aiide.v18i1.21959.

[28] J. Hao et al. Exploration in Deep Reinforcement Learning: From Single-Agent to Multiagent Domain. IEEE Transactions on Neural Networks and Learning Systems.
DOI: 10.1109/TNNLS.2023.3236361.

[29] D. Pathak, P. Agrawal, A. A. Efros, T. Darrell, Curiosity-driven exploration by self-supervised prediction. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2017, p. 488–489.
DOI: 10.1109/CVPRW.2017.70.

[30] A. P. Badia, P. Sprechmann, A. Vitvitskyi, D. Guo, B. Piot, S. Kapturowski, O. Tieleman, M. Arjovsky, A. Pritzel, A. Bolt, C. Blundell, Never give up: Learning directed exploration strategies, in: International Conference on Learning Representations, 2020. URL: https://arxiv.org/pdf/2002.06038 (accessed: 01.04.2024).

[31] A. Ray, J. Achiam, D. Amodei, Benchmarking safe exploration in deep reinforcement learning, Preprint, OpenAI, San Francisco, CA (2019). URL: https://openai.com/index/benchmarking-safe-exploration-in-deep-reinforcement-learning (accessed: 01.04.2024).

[32] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms.
DOI: https://doi.org/10.48550/arXiv.1707.06347.

[33] Frigge, M., Hoaglin, D. C. and Iglewicz, B. (1989). Some Implementations of the Boxplot. The American Statistician, 43(1), p. 50–54. DOI: 10.1080/00031305.1989.10475612.




DOI: http://dx.doi.org/10.26583/bit.2024.2.06

Ссылки

  • На текущий момент ссылки отсутствуют.


Лицензия Creative Commons
Это произведение доступно по лицензии Creative Commons «Attribution» («Атрибуция») 4.0 Всемирная.