Efficient Sample Reuse in Policy Gradients with Parameter-based Exploration (情報論的学習理論と機械学習)

概要

論文の詳細を見る
The policy gradient approach is a flexible and powerful reinforcement learning method particularly for problems with continuous actions such as robot control. A common challenge in this scenario is how to stabilize policy gradient estimates for reliable policy updates. In this paper, we combine the following three ideas and give a highly stable and practical policy gradient method: (a) the policy gradients with parameter based exploration, which is a recently proposed policy search method with high stability, (b) an importance sampling technique, which allows us to reuse previously gathered data in an unbiased way, and (c) an optimal baseline, which minimizes the variance of gradient estimates with their unbiasedness being maintained. For the proposed method, we give theoretical analysis of the variance of gradient estimates and show its usefulness through experiments.
一般社団法人電子情報通信学会の論文
2012-03-05