General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States

by   Francesco Faccio, et al.

Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Networks to learn a single value function for evaluating (and thus helping to improve) any policy represented by a deep neural network (NN). The method yields competitive experimental results. In continuous control problems with infinitely many states, our value function minimizes its prediction error by simultaneously learning a small set of `probing states' and a mapping from actions produced in probing states to the policy's return. The method extracts crucial abstract knowledge about the environment in form of very few states sufficient to fully specify the behavior of many policies. A policy improves solely by changing actions in probing states, following the gradient of the value function's predictions. Surprisingly, it is possible to clone the behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only by knowing how to act in 3 and 5 such learned states, respectively. Remarkably, our value function trained to evaluate NN policies is also invariant to changes of the policy architecture: we show that it allows for zero-shot learning of linear policies competitive with the best policy seen during training. Our code is public.


page 2

page 3

page 8

page 10

page 11

page 13

page 14

page 19


Parameter-based Value Functions

Learning value functions off-policy is at the core of modern Reinforceme...

Fast Adaptation via Policy-Dynamics Value Functions

Standard RL algorithms assume fixed environment dynamics and require a s...

Policy Evaluation Networks

Many reinforcement learning algorithms use value functions to guide the ...

Goal-Conditioned Generators of Deep Policies

Goal-conditioned Reinforcement Learning (RL) aims at learning optimal po...

VA-learning as a more efficient alternative to Q-learning

In reinforcement learning, the advantage function is critical for policy...

Generation of Policy-Level Explanations for Reinforcement Learning

Though reinforcement learning has greatly benefited from the incorporati...

Kalman meets Bellman: Improving Policy Evaluation through Value Tracking

Policy evaluation is a key process in Reinforcement Learning (RL). It as...

Please sign up or login with your details

Forgot password? Click here to reset