Supervised Policy Update
We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU optimizes over the proximal policy space to find a non-parameterized policy. It then solves a supervised regression problem to convert the non-parameterized policy to a parameterized policy, from which it draws new samples. There is significant flexibility in setting the labels in the supervised regression problem, with different settings corresponding to different underlying optimization problems. We develop a methodology for finding an optimal policy in the non-parameterized policy space, and show how Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) can be addressed by this methodology. In terms of sample efficiency, our experiments show SPU can outperform PPO for simulated robotic locomotion tasks.
READ FULL TEXT