MDPs with low-rank transitions – that is, the transition matrix can be
f...
Off-policy evaluation often refers to two related tasks: estimating the
...
Addressing such diverse ends as safety alignment with human preferences,...
Standard uniform convergence results bound the generalization gap of the...
Sample-efficiency guarantees for offline reinforcement learning (RL) oft...
To evaluate prospective contextual bandit policies when experimentation ...
In order to model risk aversion in reinforcement learning, an emerging l...
We cast visual imitation as a visual correspondence problem. Our robotic...