Format results
- Simon Du (University of Washington)
Statistical Efficiency in Offline Reinforcement Learning
Nathan Kallus (Cornell)Multiagent Reinforcement Learning: Rollout and Policy Iteration
Dimitri Bertsekas (ASU & MIT)Batch Policy Learning in Average Reward Markov Decision Processes
Peng Liao (Harvard)Challenges of (asymptotically safe) quantum gravity
Marc Schiffer Radboud Universiteit Nijmegen
The geometry of string compactifications
Lara Anderson Virginia Polytechnic Institute and State University
The Mean-Squared Error of Double Q-Learning
R. Srikant (University of Illinois at Urbana-Champaign)Zap Q-learning with Nonlinear Function Approximation
Sean Meyn (University of Florida)Special Topics in Astrophysics - Numerical Hydrodynamics - Lecture 22
Daniel Siegel University of Greifswald
Nearly Minimax Optimal Reward-Free Reinforcement Learning
Simon Du (University of Washington)We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases. In the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal. In the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. We give a new efficient algorithm which interacts with the environment at most $O\left( \frac{S^2A}{\epsilon^2}\poly\log\left(\frac{SAH}{\epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase. Here, $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon, and $\epsilon$ is the target accuracy relative to the total reward. Our sample complexity scales only logarithmically with $H$, in contrast to all existing results which scale polynomially with $H$. Furthermore, this bound matches the minimax lower bound $\Omega\left(\frac{S^2A}{\epsilon^2}\right)$ up to logarithmic factors.QED-mediated plasma processes in compact objects: magnetic reconnection and beyond
Hayk Hakobyan Princeton University
In compact astrophysical objects, such as neutron star magnetospheres, black-hole accretion disk coronae and jets, the main energy reservoir is the magnetic field. The plasma processes such as magnetic reconnection and turbulence govern the extraction of that energy, which is then deposited into heat and accelerated particles and, ultimately, the observed emission. To understand what we observe, we first need to describe from first principles how these processes operate in violent regimes applicable to certain classes of compact objects, where radiative drag and pair production/annihilation play a significant role. As a specific example, I will briefly cover our state-of-the-art understanding of one of these processes — magnetic reconnection — and present the first self-consistent simulations of QED-mediated reconnection in application to neutron star magnetospheres and explain how it helps us understand the observed gamma-ray emission from these objects. I will also talk about the future prospects of this area of research; QED-mediated plasma processes also take place in a variety of other astrophysical objects, such as the accretion disk coronae in X-ray binaries, coalescing neutron stars shortly before their merger, and short X-ray bursts in magnetars.
Statistical Efficiency in Offline Reinforcement Learning
Nathan Kallus (Cornell)Offline RL is crucial in applications where experimentation is limited, such as medicine, but it is also notoriously difficult because the similarity between the trajectories observed and those generated by any proposed policy diminishes exponentially as horizon grows, known as the curse of horizon. To better understand this limitation, we study the statistical efficiency limits of two central tasks in offline reinforcement learning: estimating the policy value and the policy gradient from off-policy data. The efficiency bounds reveal that the curse is generally insurmountable without assuming additional structure and as such plagues many standard estimators that work in general problems, but it may be overcome in Markovian settings and even further attenuated in stationary settings. We develop the first estimators achieving the efficiency limits in finite- and infinite-horizon MDPs using a meta-algorithm we term Double Reinforcement Learning (DRL). We provide favorable guarantees for DRL and for off-policy policy optimization via efficiently-estimated policy gradient ascent.Multiagent Reinforcement Learning: Rollout and Policy Iteration
Dimitri Bertsekas (ASU & MIT)We discuss the solution of multistage decision problems using methods that are based on the idea of policy iteration (PI for short), i.e., start from some base policy and generate an improved policy. Rollout is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program. In this paper we focus on rollout and PI-like methods for multiagent problems, where the control consists of multiple components each selected (conceptually) by a separate agent. We discuss an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy. We first develop our agent-by-agent policy improvement approach for finite horizon problems, and then we extend it to exact and approximate PI for discounted and other infinite horizon problems. We prove that the cost improvement property steers the algorithm towards convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents.Batch Policy Learning in Average Reward Markov Decision Processes
Peng Liao (Harvard)We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward for a given policy and show that it achieves semi-parametric efficiency. The proposed estimator requires estimation of two policy dependent nuisance functions. We develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the regret, i.e., the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee for our proposed method.The Fascinating, Weird World of Quantum Matter
PIRSA:20120032In her December 2 Perimeter Public Lecture webcast, Hallberg will explore examples of emergent phenomena and demonstrate how we can tackle these problems using quantum information to filter the most relevant data. By advancing research in this field, we hope to seed advances with applications from medical equipment and new materials to efficient energy generation, transportation, and storage.
Challenges of (asymptotically safe) quantum gravity
Marc Schiffer Radboud Universiteit Nijmegen
I will discuss challenges of quantum gravity, highlighting conceptual, methodological as well as phenomenological aspects. Focusing on asymptotically safe quantum gravity, I will review recent progress in addressing key theoretical challenges using continuum and lattice methods. Furthermore, I will explain how the high predictive power of the asymptotically safe fixed point for quantum gravity and matter might allow us to explain fundamental properties of our universe, for example its dimensionality. Finally, I will point out possible connections between different approaches to quantum gravity, and how these connections might help us formulating a fundamental description of nature.
The geometry of string compactifications
Lara Anderson Virginia Polytechnic Institute and State University
In string compactifications the roles of physics and geometry are intrinsically intertwined. While the goals of these 4-dimensional effective theories are physical, the path to those answers frequently leads to cutting-edge challenges in modern mathematics. In this talk, I will describe recent progress in characterizing the geometry of Calabi-Yau manifolds in terms their description as elliptic fibrations. This description has remarkable consequences for the form of the string vacuum space and the properties of string effective theories, including particle masses and couplings.
The Mean-Squared Error of Double Q-Learning
R. Srikant (University of Illinois at Urbana-Champaign)We establish a theoretical comparison between the asymptotic mean-squared error of Double Q-learning and Q-learning. Using prior work on the asymptotic mean-squared error of linear stochastic approximation based on Lyapunov equations, we show that the asymptotic mean-squared error of Double Q-learning is exactly equal to that of Q-learning if Double Q-learning uses twice the learning rate of Q-learning and outputs the average of its two estimators. We also present some practical implications of this theoretical observation using simulations.Q-learning with Uniformly Bounded Variance
Adithya Devraj (Stanford)Sample complexity bounds are a common performance metric in the RL literature. In the discounted cost, infinite horizon setting, all of the known bounds can be arbitrarily large, as the discount factor approaches unity. For a large discount factor, these bounds seem to imply that a very large number of samples is required to achieve an epsilon-optimal policy. In this talk, we will discuss a new class of algorithms that have sample complexity uniformly bounded for all discount factors. One may argue that this is impossible, due to a recent min-max lower bound. The explanation is that this previous lower bound is for a specific problem, which we modify, without compromising the ultimate objective of obtaining an epsilon-optimal policy. Specifically, we show that the asymptotic covariance of the Q-learning algorithm with an optimized step-size sequence is a quadratic function of a factor that goes to infinity, as discount factor gets close to 1; an expected, and essentially known result. The new relative Q-learning algorithm proposed here is shown to have asymptotic covariance that is uniformly bounded for all discount factors.Zap Q-learning with Nonlinear Function Approximation
Sean Meyn (University of Florida)Zap Q-learning is a recent class of reinforcement learning algorithms, motivated primarily as a means to accelerate convergence. Stability theory has been absent outside of two restrictive classes: the tabular setting, and optimal stopping. This paper introduces a new framework for analysis of a more general class of recursive algorithms known as stochastic approximation. Based on this general theory, it is shown that Zap Q-learning is consistent under a non-degeneracy assumption, even when the function approximation architecture is nonlinear. Zap Q-learning with neural network function approximation emerges as a special case, and is tested on examples from OpenAI Gym. Based on multiple experiments with a range of neural network sizes, it is found that the new algorithms converge quickly and are robust to choice of function approximation architecture.Special Topics in Astrophysics - Numerical Hydrodynamics - Lecture 22
Daniel Siegel University of Greifswald