Video URL

https://old.simons.berkeley.edu/talks/sample-complexity-policy-based-methods-under-policy-sampling-…

Sample Complexity Of Policy-Based Methods Under Off-Policy Sampling And Linear Function Approximation

(2022). Sample Complexity Of Policy-Based Methods Under Off-Policy Sampling And Linear Function Approximation. The Simons Institute for the Theory of Computing. https://old.simons.berkeley.edu/talks/sample-complexity-policy-based-methods-under-policy-sampling-and-linear-function-approximation

Sample Complexity Of Policy-Based Methods Under Off-Policy Sampling And Linear Function Approximation. The Simons Institute for the Theory of Computing, Oct. 07, 2022, https://old.simons.berkeley.edu/talks/sample-complexity-policy-based-methods-under-policy-sampling-and-linear-function-approximation

          @misc{ scivideos_22709,
            doi = {},
            url = {https://old.simons.berkeley.edu/talks/sample-complexity-policy-based-methods-under-policy-sampling-and-linear-function-approximation},
            author = {},
            keywords = {},
            language = {en},
            title = {Sample Complexity Of Policy-Based Methods Under Off-Policy Sampling And Linear Function Approximation},
            publisher = {The Simons Institute for the Theory of Computing},
            year = {2022},
            month = {oct},
            note = {22709 see, \url{https://scivideos.org/index.php/simons-institute/22709}}
          }

Siva Theja Maguluri (Georgia Institute of Technology)

October 07, 2022

Talk number22709

Source RepositorySimons Institute

Subject

Computer Science

Abstract

In this work, we study policy-space methods for solving the reinforcement learning (RL) problem. We first consider the setting where the model is known (MDP setting) and show that the Natural Policy Gradient (NPG) algorithm (and other variants) have linear (geometric) convergence without the need of any regularization. In contrast to optimization approaches used in the literature, our approach is based on approximate dynamic programing. We then consider a linear function approximation variant of it and establish its convergence guarantees. Finally, we consider the RL setting where a critic is deployed for policy evaluation when the model is unknown. We consider a critic that is based on TD learning, and uses off-policy sampling with linear function approximation. This leads to the infamous deadly-triad issue. We propose a generic algorithm framework of a single time-scale multi-step TD-learning with generalized importance sampling ratios that enables us to overcome the high variance issue in off-policy learning, and establish its finite-sample guarantees. We show that this leads to an overall sample complexity of O(epsilon^-2).

Supported by

Video URL

Sample Complexity Of Policy-Based Methods Under Off-Policy Sampling And Linear Function Approximation

Abstract

Intro to Meta-Complexity: Part 2

Intro to Meta-Complexity: Part 1

Aggregative Efficiency of Bayesian Learning in Networks

(Relaxing) Common Belief for Social Networks

Organizing Modular Production

Video URL

Sample Complexity Of Policy-Based Methods Under Off-Policy Sampling And Linear Function Approximation

APA

MLA

BibTex

Abstract

Intro to Meta-Complexity: Part 2

Intro to Meta-Complexity: Part 1

Aggregative Efficiency of Bayesian Learning in Networks

(Relaxing) Common Belief for Social Networks

Organizing Modular Production