22709

Sample Complexity Of Policy-Based Methods Under Off-Policy Sampling And Linear Function Approximation

APA

(2022). Sample Complexity Of Policy-Based Methods Under Off-Policy Sampling And Linear Function Approximation. The Simons Institute for the Theory of Computing. https://old.simons.berkeley.edu/talks/sample-complexity-policy-based-methods-under-policy-sampling-and-linear-function-approximation

MLA

Sample Complexity Of Policy-Based Methods Under Off-Policy Sampling And Linear Function Approximation. The Simons Institute for the Theory of Computing, Oct. 07, 2022, https://old.simons.berkeley.edu/talks/sample-complexity-policy-based-methods-under-policy-sampling-and-linear-function-approximation

BibTex

          @misc{ scivideos_22709,
            doi = {},
            url = {https://old.simons.berkeley.edu/talks/sample-complexity-policy-based-methods-under-policy-sampling-and-linear-function-approximation},
            author = {},
            keywords = {},
            language = {en},
            title = {Sample Complexity Of Policy-Based Methods Under Off-Policy Sampling And Linear Function Approximation},
            publisher = {The Simons Institute for the Theory of Computing},
            year = {2022},
            month = {oct},
            note = {22709 see, \url{https://scivideos.org/simons-institute/22709}}
          }
          
Siva Theja Maguluri (Georgia Institute of Technology)
Talk number22709
Source RepositorySimons Institute

Abstract

In this work, we study policy-space methods for solving the reinforcement learning (RL) problem. We first consider the setting where the model is known (MDP setting) and show that the Natural Policy Gradient (NPG) algorithm (and other variants) have linear (geometric) convergence without the need of any regularization. In contrast to optimization approaches used in the literature, our approach is based on approximate dynamic programing. We then consider a linear function approximation variant of it and establish its convergence guarantees. Finally, we consider the RL setting where a critic is deployed for policy evaluation when the model is unknown. We consider a critic that is based on TD learning, and uses off-policy sampling with linear function approximation. This leads to the infamous deadly-triad issue. We propose a generic algorithm framework of a single time-scale multi-step TD-learning with generalized importance sampling ratios that enables us to overcome the high variance issue in off-policy learning, and establish its finite-sample guarantees. We show that this leads to an overall sample complexity of O(epsilon^-2).