Non-asymptotic Performances of Robust Markov Decision Processes

by   Wenhao Yang, et al.

In this paper, we study the non-asymptotic performance of optimal policy on robust value function with true transition dynamics. The optimal robust policy is solved from a generative model or offline dataset without access to true transition dynamics. In particular, we consider three different uncertainty sets including the L_1, ฯ‡^2 and KL balls in both (s,a)-rectangular and s-rectangular assumptions. Our results show that when we assume (s,a)-rectangular on uncertainty sets, the sample complexity is about O(|๐’ฎ|^2|๐’œ|/ฮต^2ฯ^2(1-ฮณ)^4) in the generative model setting and O(|๐’ฎ|/ฮฝ_minฮต^2ฯ^2(1-ฮณ)^4) in the offline dataset setting. While prior works on non-asymptotic performances are restricted with the KL ball and (s,a)-rectangular assumption, we also extend our results to a more general s-rectangular assumption, which leads to a larger sample complexity than the (s,a)-rectangular assumption.


