We study the effect of baselines in on-policy stochastic policy gradient...
Temporal difference methods enable efficient estimation of value functio...
Importance sampling (IS) is a common reweighting strategy for off-policy...
Estimating the value function for a fixed policy is a fundamental proble...