Computer Science

Displaying 265 - 276 of 384

Zap Q-learning with Nonlinear Function Approximation

Sean Meyn (University of Florida)
December 02, 2020

16824
Computer Science
Uniform Offline Policy Evaluation and Offline Learning in Tabular RL

Yu-Xiang Wang (UC Santa Barbara)
December 01, 2020

16823
Computer Science
Batch Value-function Approximation with Only Realizability

Nan Jiang (University of Illinois at Urbana-Champaign)
December 01, 2020

16822
Computer Science
Reinforcement Learning using Generative Models for Continuous State and Action Space Systems

Rahul Jain (USC)
December 01, 2020

16821
Computer Science
Monte Carlo Sampling Approach to Solving Stochastic Multistage Programs

Alex Shapiro (Georgia Tech)
December 01, 2020

16820
Computer Science
Robust Learning of Stochastic Dynamical Systems

Munther Dahleh (MIT)
December 01, 2020

16819
Computer Science
Confident Off-policy Evaluation and Selection through Self-Normalized Importance Weighting

Ilja Kuzborskij (DeepMind)
November 30, 2020

16818
Computer Science
Reinforcement Learning

Martin Wainwright (UC Berkeley)
November 10, 2020

16637
Computer Science
Causal Inference

Bin Yu (UC Berkeley)
November 10, 2020

16634
Computer Science
What Are the Statistical Limits of Offline Reinforcement Learning With Function Approximation?

Sham Kakade (University of Washington & Microsoft Research)
October 30, 2020

16709
Computer Science
An Alternative Softmax Operator for Reinforcement Learning

Michael Littman (Brown University)
October 30, 2020

16708
Computer Science
On the Global Convergence and Approximation Benefits of Policy Gradient Methods

Daniel Russo (Columbia University)
October 30, 2020

16707
Computer Science

Zap Q-learning with Nonlinear Function Approximation

Sean Meyn (University of Florida)
December 02, 2020

16824
Computer Science
Zap Q-learning is a recent class of reinforcement learning algorithms, motivated primarily as a means to accelerate convergence. Stability theory has been absent outside of two restrictive classes: the tabular setting, and optimal stopping. This paper introduces a new framework for analysis of a more general class of recursive algorithms known as stochastic approximation. Based on this general theory, it is shown that Zap Q-learning is consistent under a non-degeneracy assumption, even when the function approximation architecture is nonlinear. Zap Q-learning with neural network function approximation emerges as a special case, and is tested on examples from OpenAI Gym. Based on multiple experiments with a range of neural network sizes, it is found that the new algorithms converge quickly and are robust to choice of function approximation architecture.
Uniform Offline Policy Evaluation and Offline Learning in Tabular RL

Yu-Xiang Wang (UC Santa Barbara)
December 01, 2020

16823
Computer Science
The problem of Offline Policy Evaluation (OPE) in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications. Existing work on OPE mostly focus on evaluating a fixed target policy π, which does not provide useful bounds for offline policy learning as π will then be data-dependent. We address this problem by simultaneously evaluating all policies in a policy class Π -- uniform convergence in OPE -- and obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of O˜(H3/dmϵ2) in identifying an ϵ-optimal policy under the time-inhomogeneous episodic MDP model (H is the planning horizon, dm is a quantity that reflects the exploration of the logging policy μ). To the best of our knowledge, this is the first time the optimal rate is shown to be possible for the offline RL setting and the paper is the first that systematically investigates the uniform convergence in OPE.
Batch Value-function Approximation with Only Realizability

Nan Jiang (University of Illinois at Urbana-Champaign)
December 01, 2020

16822
Computer Science
In this talk I will discuss recent progress on a long-standing open problem in batch RL: learning Q* from an exploratory and polynomial-sized dataset, using a realizable and otherwise arbitrary function class. In fact, all existing algorithms demand function-approximation assumptions stronger than realizability, and the mounting negative evidence has led to a conjecture that sample-efficient learning is impossible in this setting (Chen and Jiang, 2019). Our algorithm, BVFT, breaks the hardness conjecture (albeit under a stronger notion of exploratory data) via a tournament procedure that reduces the learning problem to pairwise comparison, and solves the latter with the help of a state-action partition constructed from the compared functions. I will also discuss how BVFT can be applied to model selection / holdout validation among other extensions and open problems.
Reinforcement Learning using Generative Models for Continuous State and Action Space Systems

Rahul Jain (USC)
December 01, 2020

16821
Computer Science
Reinforcement Learning (RL) problems for continuous state and action space systems are among the most challenging in RL. Recently, deep reinforcement learning methods have been shown to be quite effective for certain RL problems in settings of very large/continuous state and action spaces. But such methods require extensive hyper-parameter tuning, huge amount of data, and come with no performance guarantees. We note that such methods are mostly trained `offline’ on experience replay buffers. In this talk, I will describe a series of simple reinforcement learning schemes for various settings. Our premise is that we have access to a generative model that can give us simulated samples of the next state. I will introduce the RANDPOL (randomized function approximation for policy iteration) algorithm, an empirical actor-critic algorithm that uses randomized neural networks that can successfully solve a tough robotic problem with continuous state and action spaces. We also provide theoretical performance guarantees for the algorithm. Specifically, it allows for arbitrarily good approximation with high probability for any problem. I will also touch upon the probabilistic contraction analysis framework of iterative stochastic algorithms that underpins the theoretical analysis. This talk is based on work with Hiteshi Sharma (Microsoft).
Monte Carlo Sampling Approach to Solving Stochastic Multistage Programs

Alex Shapiro (Georgia Tech)
December 01, 2020

16820
Computer Science
In this talk we discuss computational approaches based on Monte Carlo sampling techniques to solving multistage stochastic programming problems. In some applications the considered programs have a periodical behavior. We demonstrate that in such cases it is possible to drastically reduce the number of stages by introducing a periodical analog of the so-called Bellman equations, used in Markov Decision Processes and Stochastic Optimal Control. Furthermore, we describe a primal - dual variant of the Stochastic Dual Dynamic Programming algorithm, applied to the constructed periodical Bellman equations. We consider risk neutral and risk averse settings.
Robust Learning of Stochastic Dynamical Systems

Munther Dahleh (MIT)
December 01, 2020

16819
Computer Science
Robust control theory highlighted the importance of quantifying model uncertainty for the design of feedback control strategies that achieve a provable level of performance. The robustness paradigm motivated work on ‘robust learning’ to address the question of how well model uncertainty can be characterized from data. The quality and convergence rate of model approximation from data imposes a fundamental limit on the rate of adaptation of the underlying control/decision strategy. In particular, for some model classes, sample complexity results impose significant limitations on such adaptation rates. We conjecture that the same limitations exist for convergence in reinforcement learning. The characterization of the relationship between learning and model uncertainty hinges on having a tractable theory for model approximation. Such a theory exists for broad classes of linear stochastic dynamical systems. In this talk, I will present some results for learning classes of stochastic systems, namely, jump linear systems. A key question for such models is the unraveling of the underlying model structure from data. In this context, I will show that spectral methods allow for estimating both the underlying state dimension when the number of stochastic transitions is known. Utilizing existing results from model reduction using Hankel-like matrices, I will show that efficient learning of low dimensional models is quite possible. This work is in collaboration with Tuhin Sarkar and Alexander Rakhlin.
Confident Off-policy Evaluation and Selection through Self-Normalized Importance Weighting

Ilja Kuzborskij (DeepMind)
November 30, 2020

16818
Computer Science
No abstract available.
Reinforcement Learning

Martin Wainwright (UC Berkeley)
November 10, 2020

16637
Computer Science
No abstract available.
Causal Inference

Bin Yu (UC Berkeley)
November 10, 2020

16634
Computer Science
No abstract available.
What Are the Statistical Limits of Offline Reinforcement Learning With Function Approximation?

Sham Kakade (University of Washington & Microsoft Research)
October 30, 2020

16709
Computer Science
The area of offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. As such, the approach is becoming increasingly important to numerous areas of science, engineering, and technology. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of _every_ policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of _any_ given policy. Our results highlight that sample-efficient, offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability). This is joint work with Ruosong Wang and Dean Foster.
An Alternative Softmax Operator for Reinforcement Learning

Michael Littman (Brown University)
October 30, 2020

16708
Computer Science
A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one's weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice. (With Kavosh Asadi.)
On the Global Convergence and Approximation Benefits of Policy Gradient Methods

Daniel Russo (Columbia University)
October 30, 2020

16707
Computer Science
Policy gradients methods apply to complex, poorly understood, control problems by performing stochastic gradient descent over a parameterized class of polices. Unfortunately, due to the multi-period nature of the objective, policy gradient algorithms face non-convex optimization problems and can get stuck in suboptimal local minima even for extremely simple problems. This talk with discus structural properties ‚Äì shared by several canonical control problems ‚Äì that guarantee the policy gradient objective function has no suboptimal stationary points despite being non-convex. Time permitting, I‚Äôll then zoom in on the special case of state aggregated policies and a proof showing that policy gradient converges to better policies than its relative, approximate policy iteration.

Title	Speaker(s)	Date	Info link
Zap Q-learning with Nonlinear Function Approximation	Sean Meyn (University of Florida)	2020-12-02	View details
Uniform Offline Policy Evaluation and Offline Learning in Tabular RL	Yu-Xiang Wang (UC Santa Barbara)	2020-12-01	View details
Batch Value-function Approximation with Only Realizability	Nan Jiang (University of Illinois at Urbana-Champaign)	2020-12-01	View details
Reinforcement Learning using Generative Models for Continuous State and Action Space Systems	Rahul Jain (USC)	2020-12-01	View details
Monte Carlo Sampling Approach to Solving Stochastic Multistage Programs	Alex Shapiro (Georgia Tech)	2020-12-01	View details
Robust Learning of Stochastic Dynamical Systems	Munther Dahleh (MIT)	2020-12-01	View details
Confident Off-policy Evaluation and Selection through Self-Normalized Importance Weighting	Ilja Kuzborskij (DeepMind)	2020-11-30	View details
Reinforcement Learning	Martin Wainwright (UC Berkeley)	2020-11-10	View details
Causal Inference	Bin Yu (UC Berkeley)	2020-11-10	View details
What Are the Statistical Limits of Offline Reinforcement Learning With Function Approximation?	Sham Kakade (University of Washington & Microsoft Research)	2020-10-30	View details
An Alternative Softmax Operator for Reinforcement Learning	Michael Littman (Brown University)	2020-10-30	View details
On the Global Convergence and Approximation Benefits of Policy Gradient Methods	Daniel Russo (Columbia University)	2020-10-30	View details

Supported by

Format results

Zap Q-learning with Nonlinear Function Approximation

Uniform Offline Policy Evaluation and Offline Learning in Tabular RL

Batch Value-function Approximation with Only Realizability

Reinforcement Learning using Generative Models for Continuous State and Action Space Systems

Monte Carlo Sampling Approach to Solving Stochastic Multistage Programs

Robust Learning of Stochastic Dynamical Systems

Confident Off-policy Evaluation and Selection through Self-Normalized Importance Weighting

Reinforcement Learning

Causal Inference

What Are the Statistical Limits of Offline Reinforcement Learning With Function Approximation?

An Alternative Softmax Operator for Reinforcement Learning

On the Global Convergence and Approximation Benefits of Policy Gradient Methods

Zap Q-learning with Nonlinear Function Approximation

Uniform Offline Policy Evaluation and Offline Learning in Tabular RL

Batch Value-function Approximation with Only Realizability

Reinforcement Learning using Generative Models for Continuous State and Action Space Systems

Monte Carlo Sampling Approach to Solving Stochastic Multistage Programs

Robust Learning of Stochastic Dynamical Systems

Confident Off-policy Evaluation and Selection through Self-Normalized Importance Weighting

Reinforcement Learning

Causal Inference

What Are the Statistical Limits of Offline Reinforcement Learning With Function Approximation?

An Alternative Softmax Operator for Reinforcement Learning

On the Global Convergence and Approximation Benefits of Policy Gradient Methods