Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping. We propose an optimistic policy optimization algorithm with Bernstein bonus and show that it can achieve Õ(dH√(T)) regret, where H is the length of the episode, T is the number of interaction with the MDP and d is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of Ω̃(dH√(T)) up to logarithmic factors. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for adversarial Markov decision processes with linear function approximation.
READ FULL TEXT