联合随机性策略的深度强化学习探索方法-AET-电子技术应用

联合随机性策略的深度强化学习探索方法

信息技术与网络安全

杨尚彤，王子磊

(中国科学技术大学网络空间安全学院，安徽合肥230027)

摘要： 目前深度强化学习算法已经可以解决许多复杂的任务，然而如何平衡探索和利用的关系仍然是强化学习领域的一个基本的难题，为此提出一种联合随机性策略的深度强化学习探索方法。该方法利用随机性策略具有探索能力的特点，用随机性策略生成的经验样本训练确定性策略，鼓励确定性策略在保持自身优势的前提下学会探索。通过结合确定性策略算法DDPG和提出的探索方法，得到基于随机性策略指导的确定性策略梯度算法(SGDPG)。在多个复杂环境下的实验表明，面对探索问题，SGDPG的探索效率和样本利用率要优于DDPG算法。

关键词： 强化学习深度强化学习探索利用困境

中图分类号： TP18
文献标识码： A
DOI： 10.19358/j.issn.2096-5133.2021.06.008
引用格式：杨尚彤，王子磊. 联合随机性策略的深度强化学习探索方法[J].信息技术与网络安全，2021，40(6)：43-49.

Efficient exploration with stochastic policy for deep reinforcement learning

Yang Shangtong，Wang Zilei

(School of Cyberspace Security，University of Science and Technology of China，Hefei 230027，China)

Abstract： At present, deep reinforcement learning algorithms have been shown to solve many complex tasks, but how to balance the relationship between exploration and exploitation is still a basic problem. Thus, this paper proposes an efficient exploration strategy combined with stochastic policy for deep reinforcement learning. The main contribution is to use the experience generated by stochastic policies to train deterministic policies, which encourages deterministic strategies to learn to explore while maintaining their own advantages. This takes advantage of the exploration ability of stochastic policies. By combining DDPG(Deep Deterministic Policy Gradient) and the proposed exploration method, the algorithm called stochastic guidance for deterministic policy gradient(SGDPG) is obtained. Finally, the results of the experiment in several complex environments show that SGDPG has higher exploration and sample efficiency than DDPG when faced with deep exploration problems.

Key words : reinforcement learning；deep reinforcement learning；exploration-exploitation dilemma

0 引言

目前，强化学习(reinforcement learning)作为机器学习领域的一个研究热点，已经在序列决策问题中取得了巨大的进步，广泛应用于游戏博弈[1]、机器人控制[2]、工业应用[3]等领域。近年来，许多强化学习方法利用神经网络来提高其性能，于是有了一个新的研究领域，被称为深度强化学习(Deep Reinfor-

cement Learning，DRL)[4]。但是强化学习仍然面临一个主要的问题：探索利用困境(exploration-exploitation dilemma)。在智能体学习过程中，探索(exploration)意味着智能体尝试之前没有做过的动作，有可能获得更高的利益，而利用(exploitation)是指智能体根据之前的经验选择当前最优的动作。目前，深度强化学习方法的研究主要集中在结合深度学习提高强化学习算法的泛化能力，如何有效地探索状态空间仍然是一个关键的挑战。

本文详细内容请下载：http://www.chinaaet.com/resource/share/2000003599

作者信息：

杨尚彤，王子磊

(中国科学技术大学网络空间安全学院，安徽合肥230027)

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容