TD(λ)学習の対数時間更新算法

概要

論文の詳細を見る
Temporal-difference (TD) method is an incremental learning method for long term predictionproblem. Most reinforcement learning methods are based on it. So as to cope with partial observability, we have to combine it with the idea of eligibility traces, which causes the matter of time complexity. There are some conventional ways to reduce it, which are unavailable in environments where there may be long delay between observations and their conseqtuent rewards. In this paper we propose an algorithm which acctrrately computes TD (λ) updating in logarithmic time. It can safely be used for all kinds of environments, because it is proved to give the accurate TD prediction. We also apply our algorithm to Sarsa (λ), which is a reinforcement learning method using eligibility traces. We can also apply it to Q (λ)-learnings. The accumulating Sarsa (λ) usually takes time linear in the number of the actions for action selection. There exists two definitions of replacing Sarsa (λ), the more common and better one of which can be computed in time logarithmic in the number of the observations and that of the actions, owing to a device.
社団法人人工知能学会の論文
1999-09-01