Computer Science

Displaying 61 - 72 of 384

Optimal Learning for Structured Bandits

Negin Golrezei (Massachusetts Institute of Technology)
October 13, 2022

22749
Computer Science
Dynamic Spatial Matching and Applications

Yash Kanoria (Columbia University)
October 12, 2022

22748
Computer Science
New Results on Primal-Dual Algorithms for Online Allocation Problems With Applications to Budget Pacing in Online Advertising

Haihao Lu (Chicago Booth School of Business)
October 12, 2022

22747
Computer Science
Fast Algorithms for Online Stochastic Convex Programming

Nikhil Devanur (Amazon)
October 12, 2022

22746
Computer Science
Learning Across Bandits in High Dimension via Robust Statistics

Hamsa Bastani (University of Pennsylvania)
October 11, 2022

22745
Computer Science
Are Multicriteria MDPs Harder to Solve Than Single-Criteria MDPs?

Csaba Szepasvari (University of Alberta, Google DeepMind)
October 11, 2022

22744
Computer Science
A Game-Theoretic Approach to Offline Reinforcement Learning

Ching-An Cheng (Microsoft Research)
October 11, 2022

22743
Computer Science
The Statistical Complexity of Interactive Decision Making

Dylan Foster (Microsoft Research)
October 11, 2022

22742
Computer Science
A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With Applications in Reinforcement Learning

Zaiwei Chen (Caltech)
October 10, 2022

22740
Computer Science
A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With Applications in Reinforcement Learning

Siva Theja Maguluri (Georgia Institute of Technology)
October 10, 2022

22741
Computer Science
Stochastic Bin Packing with Time-Varying Item Sizes

Weina Wang (Carnegie Mellon University)
October 10, 2022

22739
Computer Science
Constant Regret in Exchangeable Action Models: Overbooking, Bin Packing, and Beyond

Daniel Freund (MIT)
October 10, 2022

22738
Computer Science

Optimal Learning for Structured Bandits

Negin Golrezei (Massachusetts Institute of Technology)
October 13, 2022

22749
Computer Science
In this work, we study structured multi-armed bandits, which is the problem of online decision-making under uncertainty in the presence of structural information. In this problem, the decision-maker needs to discover the best course of action despite observing only uncertain rewards over time. The decision-maker is aware of certain structural information regarding the reward distributions and would like to minimize their regret by exploiting this information, where the regret is its performance difference against a benchmark policy that knows the best action ahead of time. In the absence of structural information, the classical upper confidence bound (UCB) and Thomson sampling algorithms are well known to suffer only minimal regret. As recently pointed out, neither algorithms are, however, capable of exploiting structural information that is commonly available in practice. We propose a novel learning algorithm that we call DUSA whose worst-case regret matches the information-theoretic regret lower bound up to a constant factor and can handle a wide range of structural information. Our algorithm DUSA solves a dual counterpart of the regret lower bound at the empirical reward distribution and follows its suggested play. Our proposed algorithm is the first computationally viable learning policy for structured bandit problems that has asymptotic minimal regret.
Dynamic Spatial Matching and Applications

Yash Kanoria (Columbia University)
October 12, 2022

22748
Computer Science
Motivated by a variety of online matching markets, we consider demand and supply which arise i.i.d. in [0,1]^d, and need to be matched with each other while minimizing the expected average distance between matched pairs (the "cost"). We characterize the achievable cost scaling in three models as a function of the dimension d and the amount of excess supply (M or m): (i) Static matching of N demand units with N+M supply units. (ii) A semi-dynamic model where N+M supply units are present beforehand and N demand units arrive sequentially and must be matched immediately. (iii) A fully dynamic model where there are always m supply units present in the system, one supply and one demand unit arrive in each period, and the demand unit must be matched immediately. We show that, despite the matching constraint and uncertainty about the future, cost nearly as small as the distance to the nearest neighbor is achievable in all cases *except* models (i) and (ii) for d=1 and M = o(N). Moreover, the latter is the only case in models (i) and (ii) where excess supply significantly reduces the achievable cost. In papers with Omar Besbes, Akshit Kumar and Yilun Chen, we introduce application-inspired variants of model (ii) in which the distributions of supply and demand can be different, and generate new algorithmic and scaling insights. Links to papers: Paper 1, Paper 2 and Paper 3.
New Results on Primal-Dual Algorithms for Online Allocation Problems With Applications to Budget Pacing in Online Advertising

Haihao Lu (Chicago Booth School of Business)
October 12, 2022

22747
Computer Science
Online allocation problems with resource constraints are central problems in revenue management and online advertising. In these problems, requests arrive sequentially during a finite horizon and, for each request, a decision maker needs to choose an action that consumes a certain amount of resources and generates reward. The objective is to maximize cumulative rewards subject to a constraint on the total consumption of resources. In this talk, we consider a data-driven setting in which the reward and resource consumption of each request are generated using an input model that is unknown to the decision maker. We design a general class of algorithms that attain good performance in various input models without knowing which type of input they are facing. In particular, our algorithms are asymptotically optimal under independent and identically distributed inputs as well as various non-stationary stochastic input models, and they attain an asymptotically optimal fixed competitive ratio when the input is adversarial. Our algorithms operate in the Lagrangian dual space: they maintain a dual multiplier for each resource that is updated using online algorithms, such as, online mirror descent and PI controller. Our proposed algorithms are simple, fast, robust to estimation errors, and do not require convexity in the revenue function, consumption function and action space, in contrast to existing methods for online allocation problems. The major motivation of this line of research is the applications to budget pacing in online advertising.
Fast Algorithms for Online Stochastic Convex Programming

Nikhil Devanur (Amazon)
October 12, 2022

22746
Computer Science
We introduce the online stochastic Convex Programming (CP) problem, a very general version of stochastic online problems which allows arbitrary concave objectives and convex feasibility constraints. Many well-studied problems like the Adwords problem, online stochastic packing and covering, online stochastic matching with concave returns, etc. form a special case of online stochastic CP. We present fast algorithms for these problems, which achieve near-optimal regret guarantees for both the i.i.d. and the random permutation models of stochastic inputs. When applied to the special case of online packing, our ideas yield a simpler and faster primal-dual algorithm for this well studied problem, which achieves the optimal competitive ratio. Our techniques make explicit the connection of primal-dual paradigm and online learning to online stochastic CP.
Learning Across Bandits in High Dimension via Robust Statistics

Hamsa Bastani (University of Pennsylvania)
October 11, 2022

22745
Computer Science
Decision-makers often face the "many bandits" problem, where one must simultaneously learn across related but heterogeneous contextual bandit instances. For instance, a large retailer may wish to dynamically learn product demand across many stores to solve pricing or inventory problems, making it desirable to learn jointly for stores serving similar customers; alternatively, a hospital network may wish to dynamically learn patient risk across many providers to allocate personalized interventions, making it desirable to learn jointly for hospitals serving similar patient populations. Motivated by real datasets, we decompose the unknown parameter in each bandit instance into a global parameter plus a sparse instance-specific term. Then, we propose a novel two-stage estimator that exploits this structure in a sample-efficient way by using a combination of robust statistics (to learn across similar instances) and LASSO regression (to debias the results). We embed this estimator within a bandit algorithm, and prove that it improves asymptotic regret bounds in the context dimension; this improvement is exponential for data-poor instances. We further demonstrate how our results depend on the underlying network structure of bandit instances. Finally, we illustrate the value of our approach on synthetic and real datasets. Joint work with Kan Xu. Paper: https://arxiv.org/abs/2112.14233
Are Multicriteria MDPs Harder to Solve Than Single-Criteria MDPs?

Csaba Szepasvari (University of Alberta, Google DeepMind)
October 11, 2022

22744
Computer Science
Oftentimes, decisions involve multiple, possible conflicting rewards, or costs. For example, solving a problem faster may incur extra cost, or sacrifice safety. In cases like this, one possibility is to aim for decisions that maximize the value obtained from one of the reward functions, while keeping the value obtained from the other reward functions above some prespecified target values. Up to logarithmic factors, we resolve the optimal words-case sample complexity of finding solutions to such problems in the discounted MDP setting when a generative model of the MDP is available. While this is clearly an oversimplified problem, our analysis reveals an interesting gap between the sample complexity of this problem and the sample complexity of solving MDPs when the solver needs to return a solution which, with a prescribed probability, cannot violate the constraints. In the talk, I will explain the background of the problem, the origin of the gap, the algorithm that we know to achieve the near-optimal sample complexity, closing with some open questions. This is joint work with Sharan Vaswani and Lin F. Yang
A Game-Theoretic Approach to Offline Reinforcement Learning

Ching-An Cheng (Microsoft Research)
October 11, 2022

22743
Computer Science
Offline reinforcement learning (RL) is a paradigm for designing agents that can learn from existing datasets. Because offline RL can learn policies without collecting new data or expensive expert demonstrations, it offers great potentials for solving real-world problems. However, offline RL faces a fundamental challenge: oftentimes data in real world can only be collected by policies meeting certain criteria (e.g., on performance, safety, or ethics). As a result, existing data, though being large, could lack diversity and have limited usefulness. In this talk, I will introduce a generic game-theoretic approach to offline RL. It frames offline RL as a two-player game where a learning agent competes with an adversary that simulates the uncertain decision outcomes due to missing data coverage. By this game analogy, I will present a systematic and provably correct framework to design offline RL algorithms that can learn good policies with state-of-the-art empirical performance. In addition, I will show that this framework reveals a natural connection between offline RL and imitation learning, which ensures the learned policies to be always no worse than the data collection policies regardless of hyperparameter choices.
The Statistical Complexity of Interactive Decision Making

Dylan Foster (Microsoft Research)
October 11, 2022

22742
Computer Science
A fundamental challenge in interactive learning and decision making, ranging from bandit problems to reinforcement learning, is to provide sample-efficient, adaptive learning algorithms that achieve near-optimal regret. This question is analogous to the classical problem of optimal (supervised) statistical learning, where there are well-known complexity measures (e.g., VC dimension and Rademacher complexity) that govern the statistical complexity of learning. However, characterizing the statistical complexity of interactive learning is substantially more challenging due to the adaptive nature of the problem. In this talk, we will introduce a new complexity measure, the Decision-Estimation Coefficient, which is necessary and sufficient for sample-efficient interactive learning. In particular, we will provide: 1. a lower bound on the optimal regret for any interactive decision making problem, establishing the Decision-Estimation Coefficient as a fundamental limit. 2. a unified algorithm design principle, Estimation-to-Decisions, which attains a regret bound matching our lower bound, thereby achieving optimal sample-efficient learning as characterized by the Decision-Estimation Coefficient. Taken together, these results give a theory of learnability for interactive decision making. When applied to reinforcement learning settings, the Decision-Estimation Coefficient recovers essentially all existing hardness results and lower bounds.
A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With Applications in Reinforcement Learning

Zaiwei Chen (Caltech)
October 10, 2022

22740
Computer Science
Reinforcement learning (RL) is a learning paradigm for large-scale sequential decision making problems in complex stochastic systems. Many modern RL algorithms solve the underlying Bellman fixed point equation using Stochastic Approximation (SA). This two-part tutorial presents an overview of our results on SA, and illustrate how they can be used to obtain sample complexity results of a large class of RL algorithms. Part I of the tutorial focuses on SA, a popular approach for solving fixed point equations when the information is corrupted by noise. We consider a type of SA algorithms for operators that are contractive under arbitrary norms (especially the l-infinity norm). We present finite sample bounds on the mean square error, which are established using a Lyapunov framework based on infimal convolution and generalized Moreau envelope. We then present our recent result on exponential concentration of the tail error, even when the iterates are not bounded by a constant. These tail bounds are obtained using exponential supermartingales in conjunction with the Moreau envelop and bootstrapping. Part II of the tutorial focuses on RL. We briefly illustrate the connection between RL algorithms and SA of contractive operators, and highlight the importance of the infinity norm. We then exploit the results from Part I, to present finite sample bounds of various RL algorithms including on policy and off policy algorithms, both in tabular and linear function approximation settings.
A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With Applications in Reinforcement Learning

Siva Theja Maguluri (Georgia Institute of Technology)
October 10, 2022

22741
Computer Science
Reinforcement learning (RL) is a learning paradigm for large-scale sequential decision making problems in complex stochastic systems. Many modern RL algorithms solve the underlying Bellman fixed point equation using Stochastic Approximation (SA). This two-part tutorial presents an overview of our results on SA, and illustrate how they can be used to obtain sample complexity results of a large class of RL algorithms. Part I of the tutorial focuses on SA, a popular approach for solving fixed point equations when the information is corrupted by noise. We consider a type of SA algorithms for operators that are contractive under arbitrary norms (especially the l-infinity norm). We present finite sample bounds on the mean square error, which are established using a Lyapunov framework based on infimal convolution and generalized Moreau envelope. We then present our recent result on exponential concentration of the tail error, even when the iterates are not bounded by a constant. These tail bounds are obtained using exponential supermartingales in conjunction with the Moreau envelop and bootstrapping. Part II of the tutorial focuses on RL. We briefly illustrate the connection between RL algorithms and SA of contractive operators, and highlight the importance of the infinity norm. We then exploit the results from Part I, to present finite sample bounds of various RL algorithms including on policy and off policy algorithms, both in tabular and linear function approximation settings.
Stochastic Bin Packing with Time-Varying Item Sizes

Weina Wang (Carnegie Mellon University)
October 10, 2022

22739
Computer Science
In today's computing systems, there is a strong contention between achieving high server utilization and accommodating the time-varying resource requirements of jobs. Motivated by this problem, we study a stochastic bin packing formulation with time-varying item sizes, where bins and items correspond to servers and jobs, respectively. Our goal is to answer the following fundamental question: How can we minimize the number of active servers (servers running at least one job) given a budget for the cost associated with resource overcommitment on servers? We propose a novel framework for designing job dispatching policies, which reduces the problem to a policy design problem in a single-server system through policy conversions. Through this framework, we develop a policy that is asymptotically optimal as the job arrival rate increases. This is a joint work with Yige Hong at Carnegie Mellon University and Qiaomin Xie at University of Wisconsin–Madison.
Constant Regret in Exchangeable Action Models: Overbooking, Bin Packing, and Beyond

Daniel Freund (MIT)
October 10, 2022

22738
Computer Science
Many problems in online decision-making can be viewed through a lens of exchangeable actions: over a long time-horizon of length T some arrival process allows for many opportunities to take a particular action, but the exact timing of actions is irrelevant for the eventual outcome. Examples of this phenomenon include bin packing (where it does not matter when we put an item of a given size into a given bin), knapsack (where it does not matter when we accept an item of a given value), and a range of matching problems among others (where it does not matter when we assign a request type to a given resource). In this talk we survey a number of results that capture the conditions under which such problems give rise to algorithms with uniform loss guarantees, i.e., where the additive loss relative to an optimal solution in hindsight is bounded independent of the horizon length. Our conditions characterize permissible geometries of the objective function, minimal requirements on the arrival process, and uncertainty about the underlying arrival processes, including the length of the time horizon. Based on joint works with Jiayu (Kamessi) Zhao, Chamsi Hssaine, Sid Banerjee, Thodoris Lykouris, Wentao Weng, Elisabeth Paulson, and Brad Sturt

Title	Speaker(s)	Date	Info link
Optimal Learning for Structured Bandits	Negin Golrezei (Massachusetts Institute of Technology)	2022-10-13	View details
Dynamic Spatial Matching and Applications	Yash Kanoria (Columbia University)	2022-10-12	View details
New Results on Primal-Dual Algorithms for Online Allocation Problems With Applications to Budget Pacing in Online Advertising	Haihao Lu (Chicago Booth School of Business)	2022-10-12	View details
Fast Algorithms for Online Stochastic Convex Programming	Nikhil Devanur (Amazon)	2022-10-12	View details
Learning Across Bandits in High Dimension via Robust Statistics	Hamsa Bastani (University of Pennsylvania)	2022-10-11	View details
Are Multicriteria MDPs Harder to Solve Than Single-Criteria MDPs?	Csaba Szepasvari (University of Alberta, Google DeepMind)	2022-10-11	View details
A Game-Theoretic Approach to Offline Reinforcement Learning	Ching-An Cheng (Microsoft Research)	2022-10-11	View details
The Statistical Complexity of Interactive Decision Making	Dylan Foster (Microsoft Research)	2022-10-11	View details
A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With Applications in Reinforcement Learning	Zaiwei Chen (Caltech)	2022-10-10	View details
A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With Applications in Reinforcement Learning	Siva Theja Maguluri (Georgia Institute of Technology)	2022-10-10	View details
Stochastic Bin Packing with Time-Varying Item Sizes	Weina Wang (Carnegie Mellon University)	2022-10-10	View details
Constant Regret in Exchangeable Action Models: Overbooking, Bin Packing, and Beyond	Daniel Freund (MIT)	2022-10-10	View details

Supported by

Format results

Optimal Learning for Structured Bandits

Dynamic Spatial Matching and Applications

New Results on Primal-Dual Algorithms for Online Allocation Problems With Applications to Budget Pacing in Online Advertising

Fast Algorithms for Online Stochastic Convex Programming

Learning Across Bandits in High Dimension via Robust Statistics

Are Multicriteria MDPs Harder to Solve Than Single-Criteria MDPs?

A Game-Theoretic Approach to Offline Reinforcement Learning

The Statistical Complexity of Interactive Decision Making

A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With Applications in Reinforcement Learning

A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With Applications in Reinforcement Learning

Stochastic Bin Packing with Time-Varying Item Sizes

Constant Regret in Exchangeable Action Models: Overbooking, Bin Packing, and Beyond

Optimal Learning for Structured Bandits

Dynamic Spatial Matching and Applications

New Results on Primal-Dual Algorithms for Online Allocation Problems With Applications to Budget Pacing in Online Advertising

Fast Algorithms for Online Stochastic Convex Programming

Learning Across Bandits in High Dimension via Robust Statistics

Are Multicriteria MDPs Harder to Solve Than Single-Criteria MDPs?

A Game-Theoretic Approach to Offline Reinforcement Learning

The Statistical Complexity of Interactive Decision Making

A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With Applications in Reinforcement Learning

A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With Applications in Reinforcement Learning

Stochastic Bin Packing with Time-Varying Item Sizes

Constant Regret in Exchangeable Action Models: Overbooking, Bin Packing, and Beyond