Observations indicate the existence of natural particle accelerators in the Milky Way, capable of producing PeV cosmic rays (“PeVatrons”). Observations also indicate the existence of extreme sources in the Milky Way, capable of producing gamma-ray radiations above 100 TeV. If these gamma-ray sources are hadronic cosmic-ray accelerators, then they must also be neutrino sources. However, no neutrino sources have been detected. How can we consistently understand the observations of cosmic rays, gamma rays, and neutrinos? We point out two extreme scenarios are allowed: (1) the hadronic cosmic-ray accelerators and the gamma-ray sources are the same objects, so that neutrino sources exist and improved telescopes can detect them, versus (2) the hadronic cosmic-ray accelerators and the gamma-ray sources are distinct, so that there are no detectable neutrino sources. We discuss the nature of Milky Way’s highest energy gamma-ray sources and outline future prospects toward understanding the origin of hadronic cosmic rays.
A fundamental challenge in interactive learning and decision making, ranging from bandit problems to reinforcement learning, is to provide sample-efficient, adaptive learning algorithms that achieve near-optimal regret. This question is analogous to the classical problem of optimal (supervised) statistical learning, where there are well-known complexity measures (e.g., VC dimension and Rademacher complexity) that govern the statistical complexity of learning. However, characterizing the statistical complexity of interactive learning is substantially more challenging due to the adaptive nature of the problem. In this talk, we will introduce a new complexity measure, the Decision-Estimation Coefficient, which is necessary and sufficient for sample-efficient interactive learning. In particular, we will provide: 1. a lower bound on the optimal regret for any interactive decision making problem, establishing the Decision-Estimation Coefficient as a fundamental limit. 2. a unified algorithm design principle, Estimation-to-Decisions, which attains a regret bound matching our lower bound, thereby achieving optimal sample-efficient learning as characterized by the Decision-Estimation Coefficient. Taken together, these results give a theory of learnability for interactive decision making. When applied to reinforcement learning settings, the Decision-Estimation Coefficient recovers essentially all existing hardness results and lower bounds.
Weak gravitational lensing by the intervening large-scale structure (LSS) of the Universe is the leading non-linear effect on the anisotropies of the cosmic microwave background (CMB). The integrated line-of-sight gravitational potential that causes the distortion can be reconstructed from the lensed temperature and polarization anisotropies via estimators quadratic in the CMB modes. While previous studies have focused on the lensing power spectrum, upcoming experiments will be sensitive to the bispectrum of the lensing field, sourced by the non-linear evolution of structure. The detection of such a signal would provide additional information on late-time cosmological evolution, complementary to the power spectrum.
Reinforcement learning (RL) is a learning paradigm for large-scale sequential decision making problems in complex stochastic systems. Many modern RL algorithms solve the underlying Bellman fixed point equation using Stochastic Approximation (SA). This two-part tutorial presents an overview of our results on SA, and illustrate how they can be used to obtain sample complexity results of a large class of RL algorithms.
Part I of the tutorial focuses on SA, a popular approach for solving fixed point equations when the information is corrupted by noise. We consider a type of SA algorithms for operators that are contractive under arbitrary norms (especially the l-infinity norm). We present finite sample bounds on the mean square error, which are established using a Lyapunov framework based on infimal convolution and generalized Moreau envelope. We then present our recent result on exponential concentration of the tail error, even when the iterates are not bounded by a constant. These tail bounds are obtained using exponential supermartingales in conjunction with the Moreau envelop and bootstrapping.
Part II of the tutorial focuses on RL. We briefly illustrate the connection between RL algorithms and SA of contractive operators, and highlight the importance of the infinity norm. We then exploit the results from Part I, to present finite sample bounds of various RL algorithms including on policy and off policy algorithms, both in tabular and linear function approximation settings.
Reinforcement learning (RL) is a learning paradigm for large-scale sequential decision making problems in complex stochastic systems. Many modern RL algorithms solve the underlying Bellman fixed point equation using Stochastic Approximation (SA). This two-part tutorial presents an overview of our results on SA, and illustrate how they can be used to obtain sample complexity results of a large class of RL algorithms.
Part I of the tutorial focuses on SA, a popular approach for solving fixed point equations when the information is corrupted by noise. We consider a type of SA algorithms for operators that are contractive under arbitrary norms (especially the l-infinity norm). We present finite sample bounds on the mean square error, which are established using a Lyapunov framework based on infimal convolution and generalized Moreau envelope. We then present our recent result on exponential concentration of the tail error, even when the iterates are not bounded by a constant. These tail bounds are obtained using exponential supermartingales in conjunction with the Moreau envelop and bootstrapping.
Part II of the tutorial focuses on RL. We briefly illustrate the connection between RL algorithms and SA of contractive operators, and highlight the importance of the infinity norm. We then exploit the results from Part I, to present finite sample bounds of various RL algorithms including on policy and off policy algorithms, both in tabular and linear function approximation settings.
In today's computing systems, there is a strong contention between achieving high server utilization and accommodating the time-varying resource requirements of jobs. Motivated by this problem, we study a stochastic bin packing formulation with time-varying item sizes, where bins and items correspond to servers and jobs, respectively. Our goal is to answer the following fundamental question: How can we minimize the number of active servers (servers running at least one job) given a budget for the cost associated with resource overcommitment on servers? We propose a novel framework for designing job dispatching policies, which reduces the problem to a policy design problem in a single-server system through policy conversions. Through this framework, we develop a policy that is asymptotically optimal as the job arrival rate increases. This is a joint work with Yige Hong at Carnegie Mellon University and Qiaomin Xie at University of Wisconsin–Madison.
Many problems in online decision-making can be viewed through a lens of exchangeable actions: over a long time-horizon of length T some arrival process allows for many opportunities to take a particular action, but the exact timing of actions is irrelevant for the eventual outcome. Examples of this phenomenon include bin packing (where it does not matter when we put an item of a given size into a given bin), knapsack (where it does not matter when we accept an item of a given value), and a range of matching problems among others (where it does not matter when we assign a request type to a given resource). In this talk we survey a number of results that capture the conditions under which such problems give rise to algorithms with uniform loss guarantees, i.e., where the additive loss relative to an optimal solution in hindsight is bounded independent of the horizon length. Our conditions characterize permissible geometries of the objective function, minimal requirements on the arrival process, and uncertainty about the underlying arrival processes, including the length of the time horizon. Based on joint works with Jiayu (Kamessi) Zhao, Chamsi Hssaine, Sid Banerjee, Thodoris Lykouris, Wentao Weng, Elisabeth Paulson, and Brad Sturt
Today’s computing systems consist of servers that offer highly heterogeneous service rates due to the diverse machine types and the dynamic computing environment. In such systems, to guarantee low latency, load-balancing algorithms need to consider not only the amount of work on servers but also their service rates. However, because of the dynamic nature of the computing environment, the service rates cannot be obtained beforehand and thus have to be learned on the fly. In this talk, we consider a load-balancing system that consists of multiple servers with unknown service rates. Each incoming job needs to be assigned to one of the servers upon arrival. Our goal is to develop learning-integrated job-assigning algorithms that perform comparably to their counterparts under known service rates. Although some facets of this problem resemble the well-studied multi-armed bandit problem, our findings reveal that the exploration component is fundamentally different. In particular, we develop a meta-algorithm that integrates learning into a large class of load-balancing algorithms. The proposed algorithm uses almost no exploration, yet it achieves a constant regret in the accumulated latency of jobs. A similar free exploration phenomenon has been observed in a prior paper, but in a different setting with a single server. This is joint work with Yifei Huang and Zhiyuan Tang, both from Tsinghua University.
In this work, we study policy-space methods for solving the reinforcement learning (RL) problem. We first consider the setting where the model is known (MDP setting) and show that the Natural Policy Gradient (NPG) algorithm (and other variants) have linear (geometric) convergence without the need of any regularization. In contrast to optimization approaches used in the literature, our approach is based on approximate dynamic programing. We then consider a linear function approximation variant of it and establish its convergence guarantees. Finally, we consider the RL setting where a critic is deployed for policy evaluation when the model is unknown. We consider a critic that is based on TD learning, and uses off-policy sampling with linear function approximation. This leads to the infamous deadly-triad issue. We propose a generic algorithm framework of a single time-scale multi-step TD-learning with generalized importance sampling ratios that enables us to overcome the high variance issue in off-policy learning, and establish its finite-sample guarantees. We show that this leads to an overall sample complexity of O(epsilon^-2).
What makes online decision-making different from other decision-making/optimization problems? While it seems clear that the unique features are the sequential nature of taking actions and uncertainty in future outcomes, most techniques for solving such problems tend to obfuscate these features - so are these the best ways to think about these settings? I will present the compensated coupling: a simple paradigm for reasoning about and designing online decision-making policies, based on a sample-pathwise accounting of their performance compared to some benchmark policy. This approach generalizes many standard results used in studying Markov decision processes and reinforcement learning, but also gives us new policies which are much simpler and more effective than existing heuristics. For a large class of widely-studied control problems including online resource-allocation, dynamic pricing, generalized assignment, online bin packing, and bandits with knapsacks, I will illustrate how these new algorithms achieve constant regret (i.e., additive loss compared to the hindsight optimal which is independent of the horizon and state-space) under a wide range of conditions. Time permitting, I will try and describe how we can use this technique to incorporate side information and historical data in these settings, and achieve constant regret with as little as a single data trace.