k-確実探査法 : 強化学習における環境同定のための行動選択戦略

概要

論文の詳細を見る
Reinforcement learning aims to adapt a system to an unkown environment according to rewards. There are two issues to handle delayed reward and uncertainty. Q-learning is a representative reinforcement learning method. It is used by many works since it can learn the optimum policy. However, Q-learning needs numerous trials to converge to the optimum policy. If target environments can be described in a Markov decision process, we can identify them from statistics of sensor-action pairs. When we build the correct environment model, we can derive the optimum policy with policy Iteration Algorithm. Therefore, we can construct the optimum policy through identifying environments efficiently. In this paper, we separate learning process into two phases ; identifying an environment and determining the optimum policy. We propose k-Certainty Exploration Method for identifying an environment. After that, the optimum policy is determined by Policy Iteration Algorithm. We call a rule is k-Certainty if and only if the number of selecting it is larger than k. k-Certainty Explolation Method suppresses any loop of rules that already achieve k-Ceratinty. We show its effect by comparing with Q-learning in two experiments. 0ne is under maze environment of Dyna, the other is the environment where the optimum policy varies according to a parameter.
社団法人人工知能学会の論文
1995-05-01