16831

Nearly Minimax Optimal Reward-Free Reinforcement Learning

APA

(2020). Nearly Minimax Optimal Reward-Free Reinforcement Learning. The Simons Institute for the Theory of Computing. https://simons.berkeley.edu/talks/tbd-250

MLA

Nearly Minimax Optimal Reward-Free Reinforcement Learning. The Simons Institute for the Theory of Computing, Dec. 03, 2020, https://simons.berkeley.edu/talks/tbd-250

BibTex

          @misc{ scivideos_16831,
            doi = {},
            url = {https://simons.berkeley.edu/talks/tbd-250},
            author = {},
            keywords = {},
            language = {en},
            title = {Nearly Minimax Optimal Reward-Free Reinforcement Learning},
            publisher = {The Simons Institute for the Theory of Computing},
            year = {2020},
            month = {dec},
            note = {16831 see, \url{https://scivideos.org/index.php/Simons-Institute/16831}}
          }
          
Simon Du (University of Washington)
Talk number16831
Source RepositorySimons Institute

Abstract

We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases. In the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal. In the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. We give a new efficient algorithm which interacts with the environment at most $O\left( \frac{S^2A}{\epsilon^2}\poly\log\left(\frac{SAH}{\epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase. Here, $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon, and $\epsilon$ is the target accuracy relative to the total reward. Our sample complexity scales only logarithmically with $H$, in contrast to all existing results which scale polynomially with $H$.  Furthermore, this bound matches the minimax lower bound $\Omega\left(\frac{S^2A}{\epsilon^2}\right)$ up to logarithmic factors.