Reinforcement Learning under Drift
We propose algorithms with state-of-the-art dynamic regret bounds for un-discounted reinforcement learning under drifting non-stationarity, where both the reward functions and state transition distributions are allowed to evolve over time. Our main contributions are: 1) A tuned Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence-Widening (SWUCRL2-CW) algorithm, which attains low dynamic regret bounds against the optimal non-stationary policy in various cases. 2) The Bandit-over-Reinforcement Learning (BORL) framework that further permits us to enjoy these dynamic regret bounds in a parameter-free manner.
READ FULL TEXT