Format results
Stable Reinforcement Learning with Unbounded State Space
Qiaomin Xie (Cornell)Special Topics in Astrophysics - Numerical Hydrodynamics - Lecture 23
Daniel Siegel University of Greifswald
Quantum black holes without strings
Gerard 't Hooft Utrecht University
q-Opers, QQ-Systems, and Bethe Ansatz
Peter Koroteev University of California, Berkeley
Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation
Devavrat Shah (MIT)Nearly Minimax Optimal Reward-Free Reinforcement Learning
Simon Du (University of Washington)Statistical Efficiency in Offline Reinforcement Learning
Nathan Kallus (Cornell)Multiagent Reinforcement Learning: Rollout and Policy Iteration
Dimitri Bertsekas (ASU & MIT)Batch Policy Learning in Average Reward Markov Decision Processes
Peng Liao (Harvard)
Convergence and Sample Complexity of Gradient Methods for the Model-Free Linear Quadratic Regulator Problem
Mihailo Jovanovic (USC)Model-free reinforcement learning attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. The convergence behavior and statistical properties of these approaches are often poorly understood because of the nonconvex nature of the underlying optimization problems and the lack of exact gradient computation. In this talk, we discuss performance and efficiency of such methods by focusing on the standard infinite-horizon linear quadratic regulator problem for continuous-time systems with unknown state-space parameters. We establish exponential stability for the ordinary differential equation (ODE) that governs the gradient-flow dynamics over the set of stabilizing feedback gains and show that a similar result holds for the gradient descent method that arises from the forward Euler discretization of the corresponding ODE. We also provide theoretical bounds on the convergence rate and sample complexity of the random search method with two-point gradient estimates. We prove that the required simulation time for achieving $\epsilon$-accuracy in the model-free setup and the total number of function evaluations both scale as $\log (1/\epsilon)$.Policy Evaluation under Interference
Stefan Wager (Stanford)Most literature on policy evaluation, bandit methods, etc., is focused on settings where actions taken on one unit do not affect other units. Such lack of interference, however, fails to hold in many applications of interest. For example, in a vaccine study, one person getting vaccinated also protects others; in a microcredit study, loans given to one person may stimulate the economy and indirectly benefit others; or, in a jobs-training study, training more people to perform a given task may create over-supply of qualified workers, thus reducing the market value of the training. In this talk, I'll survey various approaches to modeling cross-unit interference, and discuss associated methods for policy evaluation.Stable Reinforcement Learning with Unbounded State Space
Qiaomin Xie (Cornell)We consider the problem of reinforcement learning (RL) with unbounded state space, motivated by the classical problem of scheduling in a queueing network. Traditional policies as well as error metric that are designed for finite, bounded or compact state space, require infinite samples for providing any meaningful performance guarantee (e.g. ℓ_∞ error) for unbounded state space. We need a new notion of performance metric. Inspired by the literature in queuing systems and control theory, we propose stability as the notion of “goodness”: the state dynamics under the policy should remain in a bounded region with high probability. As a proof of concept, we propose an RL policy using Sparse-Sampling-based Monte Carlo Oracle and argue that it satisfies the stability property as long as the system dynamics under the optimal policy respects a Lyapunov function. The assumption of existence of a Lyapunov function is not restrictive as it is equivalent to the positive recurrence or stability property of any Markov chain. And, our policy does not utilize the knowledge of the specific Lyapunov function. To make our method sample efficient, we provide an improved, sample efficient Monte Carlo Oracle with Lipschitz value. We also design an adaptive version of the algorithm, based on carefully constructed statistical tests, which finds the correct tuning parameter automatically. This work is joint with Devavrat Shah and Zhi Xu.Special Topics in Astrophysics - Numerical Hydrodynamics - Lecture 23
Daniel Siegel University of Greifswald
Quantum black holes without strings
Gerard 't Hooft Utrecht University
The quantum states of matter in the immediate vicinity of a black hole can be studied using no other information than Standard Model physics combined with perturbative gravity. The point is that the relevant energy scale of the most important fields involved is low compared to the Planck scale, provided the black hole is big compared to the Planck scale. Usually this problem is investigated by using the metric that includes the effects of matter that formed the black hole in the distant past and sometimes also matter that is radiated away in the distant future. Arguments are presented however to justify that one should ignore those effects. The metric then becomes invariant under time translation, and this is what we need to get the energy eigen states. It is this scheme that forces us to impose the antipodal identification as a new boundary condition, giving us a beautiful picture of black hole quantum evolution. The logic of choosing this compulsory topological twist in space-time is explained. There are still many very hard questions and I hope to be able to inspire people to look into these.
q-Opers, QQ-Systems, and Bethe Ansatz
Peter Koroteev University of California, Berkeley
We introduce the notions of (G,q)-opers and Miura (G,q)-opers, where G is a simply-connected complex simple Lie group, and prove some general results about their structure. We then establish a one-to-one correspondence between the set of (G,q)-opers of a certain kind and the set of nondegenerate solutions of a system of XXZ Bethe Ansatz equations. This can be viewed as a generalization of the so-called quantum/classical duality which I studied with D. Gaiotto several years ago. q-Opers generalize classical side, while on the quantum side we have more general XXZ Bethe Ansatz equations. The generalization goes beyond the scope of physics of N=2 supersymmetric gauge theories.
Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation
Devavrat Shah (MIT)We consider the question of learning Q-function in a sample efficient manner for reinforcement learning with continuous state and action spaces under a generative model. If Q-function is Lipschitz continuous, then the minimal sample complexity for estimating ϵ-optimal Q-function is known to scale as O(eps^{-(d1+d2+2)}) per classical non-parametric learning theory, where d1 and d2 denote the dimensions of the state and action spaces respectively. The Q-function, when viewed as a kernel, induces a Hilbert-Schmidt operator and hence possesses square-summable spectrum. This motivates us to consider a parametric class of Q-functions parameterized by its "rank" r, which contains all Lipschitz Q-functions as r→∞. As our key contribution, we develop a simple, iterative learning algorithm that finds ϵ-optimal Q-function with sample complexity of O(eps^{-max(d1,d2)+2}) when the optimal Q-function has low rank r and the discounting factor is below a certain threshold. Thus, this provides an exponential improvement in sample complexity. To enable our result, we develop a novel Matrix Estimation algorithm that faithfully estimates an unknown low-rank matrix with respect to max-norm sense even in the presence of arbitrary bounded noise, which might be of interest in its own right. Empirical results on several stochastic control tasks confirm the efficacy of our "low-rank" algorithms. This is based on joint work with Dogyoon Song, Zhi Xu, Yuzhe Yang. The manuscript is available at https://arxiv.org/abs/2006.06135.Nearly Minimax Optimal Reward-Free Reinforcement Learning
Simon Du (University of Washington)We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases. In the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal. In the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. We give a new efficient algorithm which interacts with the environment at most $O\left( \frac{S^2A}{\epsilon^2}\poly\log\left(\frac{SAH}{\epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase. Here, $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon, and $\epsilon$ is the target accuracy relative to the total reward. Our sample complexity scales only logarithmically with $H$, in contrast to all existing results which scale polynomially with $H$. Furthermore, this bound matches the minimax lower bound $\Omega\left(\frac{S^2A}{\epsilon^2}\right)$ up to logarithmic factors.QED-mediated plasma processes in compact objects: magnetic reconnection and beyond
Hayk Hakobyan Princeton University
In compact astrophysical objects, such as neutron star magnetospheres, black-hole accretion disk coronae and jets, the main energy reservoir is the magnetic field. The plasma processes such as magnetic reconnection and turbulence govern the extraction of that energy, which is then deposited into heat and accelerated particles and, ultimately, the observed emission. To understand what we observe, we first need to describe from first principles how these processes operate in violent regimes applicable to certain classes of compact objects, where radiative drag and pair production/annihilation play a significant role. As a specific example, I will briefly cover our state-of-the-art understanding of one of these processes — magnetic reconnection — and present the first self-consistent simulations of QED-mediated reconnection in application to neutron star magnetospheres and explain how it helps us understand the observed gamma-ray emission from these objects. I will also talk about the future prospects of this area of research; QED-mediated plasma processes also take place in a variety of other astrophysical objects, such as the accretion disk coronae in X-ray binaries, coalescing neutron stars shortly before their merger, and short X-ray bursts in magnetars.
Statistical Efficiency in Offline Reinforcement Learning
Nathan Kallus (Cornell)Offline RL is crucial in applications where experimentation is limited, such as medicine, but it is also notoriously difficult because the similarity between the trajectories observed and those generated by any proposed policy diminishes exponentially as horizon grows, known as the curse of horizon. To better understand this limitation, we study the statistical efficiency limits of two central tasks in offline reinforcement learning: estimating the policy value and the policy gradient from off-policy data. The efficiency bounds reveal that the curse is generally insurmountable without assuming additional structure and as such plagues many standard estimators that work in general problems, but it may be overcome in Markovian settings and even further attenuated in stationary settings. We develop the first estimators achieving the efficiency limits in finite- and infinite-horizon MDPs using a meta-algorithm we term Double Reinforcement Learning (DRL). We provide favorable guarantees for DRL and for off-policy policy optimization via efficiently-estimated policy gradient ascent.Multiagent Reinforcement Learning: Rollout and Policy Iteration
Dimitri Bertsekas (ASU & MIT)We discuss the solution of multistage decision problems using methods that are based on the idea of policy iteration (PI for short), i.e., start from some base policy and generate an improved policy. Rollout is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program. In this paper we focus on rollout and PI-like methods for multiagent problems, where the control consists of multiple components each selected (conceptually) by a separate agent. We discuss an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy. We first develop our agent-by-agent policy improvement approach for finite horizon problems, and then we extend it to exact and approximate PI for discounted and other infinite horizon problems. We prove that the cost improvement property steers the algorithm towards convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents.Batch Policy Learning in Average Reward Markov Decision Processes
Peng Liao (Harvard)We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward for a given policy and show that it achieves semi-parametric efficiency. The proposed estimator requires estimation of two policy dependent nuisance functions. We develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the regret, i.e., the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee for our proposed method.