Nearly Horizon-Free Offline Reinforcement Learning
We revisit offline reinforcement learning on episodic time-homogeneous tabular Markov Decision Processes with S states, A actions and planning horizon H. Given the collected N episodes data with minimum cumulative reaching probability d_m, we obtain the first set of nearly H-free sample complexity bounds for evaluation and planning using the empirical MDPs: 1.For the offline evaluation, we obtain an Õ(√(1/Nd_m)) error rate, which matches the lower bound and does not have additional dependency on (S,A) in higher-order term, that is different from previous works <cit.>. 2.For the offline policy optimization, we obtain an Õ(√(1/Nd_m) + S/Nd_m) error rate, improving upon the best known result by <cit.>, which has additional H and S factors in the main term. Furthermore, this bound approaches the Ω(√(1/Nd_m)) lower bound up to logarithmic factors and a high-order term. To the best of our knowledge, these are the first set of nearly horizon-free bounds in offline reinforcement learning.
READ FULL TEXT