Policy Optimization via Optimal Policy Evaluation

Authors

Alberto Maria Metelli, Samuele Meta, Marcello Restelli

Abstract

Off-policy methods are the basis of a large number of effective Policy Optimization (PO) algorithms. In this setting, Importance Sampling (IS) is typically employed as a what-if analysis tool, with the goal of estimating the performance of a target policy, given samples collected with a different behavioral policy. However, in Monte Carlo simulation, IS represents a variance minimization approach. In this field, a suitable behavioral distribution is employed for sampling, allowing diminishing the variance of the estimator below the one achievable when sampling from the target distribution. In this paper, we analyze IS in these two guises, showing the connections between the two objectives. We illustrate that variance minimization can be used as a performance improvement tool, with the advantage, compared with direct off-policy learning, of implicitly enforcing a trust region. We make use of these theoretical findings to build a PO algorithm, Policy Optimization via Optimal Policy Evaluation (PO2PE), that employs variance minimization as an inner loop. Finally, we present empirical evaluations on continuous RL benchmarks, with a particular focus on the robustness to small batch sizes.

Full paper

Policy Optimization via Optimal Policy Evaluation

Authors

Abstract

Programming is a woman’s job

Artificial intelligence in everyday life

Federated Learning To Predict Oxygen Needs

Deepfake: typologies and reflections, deep learning and GANs

Artificial Neural Networks to Understand the Functioning of the Mind

AI algorithm for diagnosing Covid-19 and other pathologies

Online Planning for F1 Race Strategy Identification

Subgaussian Importance Sampling for Off-Policy Evaluation and Learning