Computer Science

Displaying 277 - 288 of 384

Corruption Robust Exploration in Episodic Reinforcement Learning

Aleksandrs Slivkins (Microsoft Research NYC)
October 30, 2020

16706
Computer Science
Representation Learning and Exploration in Reinforcement Learning

Akshay Krishnamurthy (Microsoft Research)
October 30, 2020

16705
Computer Science
Multiplayer Bandit Learning - From Competition to Cooperation

Simina Branzei (Purdue University)
October 29, 2020

16704
Computer Science
Multi-Player Multi-Armed Bandit: Can We Still Collaborate at Homes Without "Zoom"?

Yuanzhi Li (Carnegie Mellon University)
October 29, 2020

16703
Computer Science
Country-Scale Bandit Implementation for Targeted COVID-19 Testing

Hamsa Bastani (Wharton School of the University of Pennsylvania)
October 29, 2020

16702
Computer Science
Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization

Negin Golrezaei (MIT)
October 29, 2020

16699
Computer Science
Beating the Curse of Dimensionality in High-Dimensional Optimal Stopping

David Goldberg (Cornell ORIE)
October 29, 2020

16700
Computer Science
Learning Outcomes in Queueing Systems

Eva Tardos (Cornell)
October 26, 2020

16701
Computer Science
Panel Discussion

Isaac Chuang (MIT), John Martinis (UC Santa Barbara), John Preskill (Caltech), and Thomas Vidick (Caltech)
May 08, 2020

15685
Computer Science
Co-Design of Almost Universal Quantum Computers

Frank Wilhelm-Mauch (Saarland University)
May 08, 2020

15634
Computer Science
Quantum Computing With Stabilized Cat Qubits

Alexandre Blais (Universite de Sherbrooke)
May 08, 2020

15633
Computer Science
Obstacles to State Preparation and Variational Optimization from Symmetry Protection

Robert König (Technical University of Munich)
May 07, 2020

15632
Computer Science

Corruption Robust Exploration in Episodic Reinforcement Learning

Aleksandrs Slivkins (Microsoft Research NYC)
October 30, 2020

16706
Computer Science
We initiate the study of episodic RL under adversarial corruptions in both the rewards and the transition probabilities of the underlying system. Our solution adapts to unknown level of corruption, degrading gracefully in the total corruption encountered. In particular, we attain near-optimal regret for a constant level of corruption. We derive results for "tabular" MDPs as well as MDPs that admit a linear representation. Notably, we provide the first sublinear regret guarantee that goes beyond i.i.d. transitions in the bandit-feedback model for episodic RL. We build on a new framework which combines the paradigms of "optimism under uncertainty" and "successive elimination". Neither paradigm alone suffices: "optimism under uncertainty", common in the current work on stochastic RL, cannot handle corruptions, while "successive elimination" works for bandits with corruptions, but is provably inefficient even for stochastic RL. Joint work Thodoris Lykouris, Max Simchowitz and Wen Sun https://arxiv.org/abs/1911.08689
Representation Learning and Exploration in Reinforcement Learning

Akshay Krishnamurthy (Microsoft Research)
October 30, 2020

16705
Computer Science
I will discuss new provably efficient algorithms for reinforcement in rich observation environments with arbitrarily large state spaces. Both algorithms operate by learning succinct representations of the environment, which they use in an exploration module to acquire new information. The first algorithm, called Homer, operates in a block MDP model and uses a contrastive learning objective to learn the representation. On the other hand, the second algorithm, called FLAMBE, operates in a much richer class of low rank MDPs and is model based. Both algorithms accommodate nonlinear function approximation and enjoy provable sample and computational efficiency guarantees.
Multiplayer Bandit Learning - From Competition to Cooperation

Simina Branzei (Purdue University)
October 29, 2020

16704
Computer Science
The stochastic multi-armed bandit model captures the tradeoff between exploration and exploitation. We study the effects of competition and cooperation on this tradeoff. Suppose there are k arms and two players, Alice and Bob. In every round, each player pulls an arm, receives the resulting reward, and observes the choice of the other player but not their reward. Alice's utility is ŒìA+ŒªŒìB (and similarly for Bob), where ŒìA is Alice's total reward and Œª‚àà[‚àí1,1] is a cooperation parameter. At Œª=‚àí1 the players are competing in a zero-sum game, at Œª=1, they are fully cooperating, and at Œª=0, they are neutral: each player's utility is their own reward. The model is related to the economics literature on strategic experimentation, where usually players observe each other's rewards. With discount factor Œ≤, the Gittins index reduces the one-player problem to the comparison between a risky arm, with a prior Œº, and a predictable arm, with success probability p. The value of p where the player is indifferent between the arms is the Gittins index g=g(Œº,Œ≤)>m, where m is the mean of the risky arm. We show that competing players explore less than a single player: there is p‚àó‚àà(m,g) so that for all p>p‚àó, the players stay at the predictable arm. However, the players are not myopic: they still explore for some p>m. On the other hand, cooperating players explore more than a single player. We also show that neutral players learn from each other, receiving strictly higher total rewards than they would playing alone, for all p‚àà(p‚àó,g), where p‚àó is the threshold from the competing case. Finally, we show that competing and neutral players eventually settle on the same arm in every Nash equilibrium, while this can fail for cooperating players. This is based on a joint work with Yuval Peres.
Multi-Player Multi-Armed Bandit: Can We Still Collaborate at Homes Without "Zoom"?

Yuanzhi Li (Carnegie Mellon University)
October 29, 2020

16703
Computer Science
Multi-armed bandit is a well-established area in online decision making: Where one player makes sequential decisions in a non-stationary environment to maximize his/her accumulative rewards. The multi-armed bandit problem becomes significantly more challenging when there are multiple players in the same environment, while only one piece of reward is presented at a time for each arm. In this setting, if two players pick the same arm at the same round, they are only able to get one piece of reward instead of two. To maximize the reward, players need to collaborate to avoid "collision" -- i.e. they need to make sure that they do not all rush to the same arm (even if it has the highest reward). We consider the even more challenging setting where communications between players are completely disabled: e.g. they are separated in different places of the world without any "Zoom". We show that nearly optimal regret can still be obtained in this setting: Players can actually collaborate in a non-stationary environment without any communication (of course, without quantum entanglement either :))
Country-Scale Bandit Implementation for Targeted COVID-19 Testing

Hamsa Bastani (Wharton School of the University of Pennsylvania)
October 29, 2020

16702
Computer Science
In collaboration with the Greek government, we use machine learning to manage the threat of COVID-19. With tens of thousands of international visitors every day, Greece cannot test each visitor to ensure that they are not a carrier of COVID-19. We developed a bandit policy that balances allocating scarce tests to (i) continuously monitor the dynamic infection risk of passengers from different locations (exploration), and (ii) preferentially target risky tourist profiles for testing (exploitation). Our solution is currently deployed across all ports of entry to Greece. I will describe a number of technical challenges, including severely imbalanced outcomes, batched/delayed feedback, high-dimensional arms, port-specific testing constraints, and transferring knowledge from (unreliable) public epidemiological data. Joint work with Kimon Drakopoulos and Vishal Gupta.
Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization

Negin Golrezaei (MIT)
October 29, 2020

16699
Computer Science
Motivated by online decision-making in time-varying combinatorial environments, we study the problem of transforming offline algorithms to their online counterparts. We focus on offline combinatorial problems that are amenable to a constant factor approximation using a greedy algorithm that is robust to local errors. For such problems, we provide a general framework that efficiently transforms offline robust greedy algorithms to online ones using Blackwell approachability. We show that the resulting online algorithms have O(T^{1/2}) (approximate) regret under the full information setting. We further introduce a bandit extension of Blackwell approachability that we call Bandit Blackwell approachability. We leverage this notion to transform greedy robust offline algorithms into an O(T^{2/3})(approximate) regret in the bandit setting. Demonstrating the flexibility of our framework, we apply our offline-to-online transformation to several problems at the intersection of revenue management, market design, and online optimization, including product ranking optimization in online platforms, reserve price optimization in auctions, and submodular maximization. We show that our transformation, when applied to these applications, leads to new regret bounds or improves the current known bounds.
Beating the Curse of Dimensionality in High-Dimensional Optimal Stopping

David Goldberg (Cornell ORIE)
October 29, 2020

16700
Computer Science
The sample and computational complexity of reinforcement learning algorithms under the generative model has been a central focus of the literature. Most existing approaches which can guarantee epsilon-optimality require a complexity either dependent on the size of the state space, or which scales exponentially in the time horizon T, in both cases suffering from the curse of dimensionality. Other approaches may be more efficient, but typically either make strong structural assumptions, or do not come with optimality guarantees. We reconcile this dilemma for the celebrated problem of high-dimensional optimal stopping, a special case of RL with important applications to financial engineering, operations research, economics and computer science, and well-known to be computationally challenging when the state space and time horizon are large. We propose the first algorithm with sample and computational complexity polynomial in T, and effectively independent of the underlying state space, which outputs epsilon-optimal stopping policies and value function approximations. Our algorithm is inspired by a novel connection between network flow theory in combinatorial optimization and the classical martingale duality theory for stopping problems, and can be viewed as a higher-order generalization of the celebrated prophet inequalities. Our approach yields an expansion for the value functions as an infinite sum, which is fundamentally different from standard expansions based on backwards induction, with the following properties : 1. each term is the expectation of an elegant recursively defined infimum; 2. the first k terms only require T^k samples to approximate; and 3. truncating at the first k terms yields a (normalized) error of 1/k. The terms of the associated sum can also be viewed as a provably good set of basis functions, which can be efficiently evaluated by simulation, in contrast with past approximate dynamic programming approaches which take a heuristic approach to deriving basis functions. Our work shows that although the state space is exponentially large, one need only explore it to the extent that one can compute these (few) basis functions, which can be done efficiently. Time permitting, we will also explore applications of our approach to other high-dimensional RL problems (including those arising in dynamic pricing and online combinatorial optimization), and discuss connections to parallel and quantum computing. Joint work with Ph.D. student Yilun Chen.
Learning Outcomes in Queueing Systems

Eva Tardos (Cornell)
October 26, 2020

16701
Computer Science
Over the last two decades we have developed good understanding how to quantify the impact of strategic user behavior on overall performance in many games (including traffic routing as well as online auctions), and showed that the resulting bounds extend to repeated games assuming players use a form of no-regret learning that helps them adapt to the environment. In this talk we will study this phenomenon in the context of a game modeling queuing systems: routers compete for servers, where packets that do not get service will be resent at future rounds, resulting in a system where the number of packets at each round depends on the success of the routers in the previous rounds. In joint work with Jason Gaitonde, we analyze the resulting highly dependent random process and find that if the capacity of the servers is high enough to allow a centralized and knowledgeable scheduler to get all packets served even with double the packet arrival rate, then learning can help the queues in coordinating their behavior, the expected number of packets in the queues will remain bounded throughout time, assuming older packets have priority.
Panel Discussion

Isaac Chuang (MIT), John Martinis (UC Santa Barbara), John Preskill (Caltech), and Thomas Vidick (Caltech)
May 08, 2020

15685
Computer Science
No abstract available.
Co-Design of Almost Universal Quantum Computers

Frank Wilhelm-Mauch (Saarland University)
May 08, 2020

15634
Computer Science
Hardware of quantum computers is limited in qubit count, fidelity, and connectivity now and for the near future. In order to gain maximum advantage from these machines, it is imperative to use these resources in an optimal and efficient way. I will present the example of the SPARQS co-design geared at simulating small clusters of the Fermi Hubbard model at finite nonzero temperature [1], that saves the experimenter the need of intersecting wires and three-dimensional chip integration. I will also show a new example of a co-design of the QAOA-algorithm that permits to avoid programmable interactions [2]. I will finally muse about the fluidity of the line between modern closed-loop gate design and modern variational algorithms, suggesting time-continuous strategies for state preparation [3]. [1] Pierre-Luc Dallaire-Demers and Frank K. Wilhelm, Phys. Rev. A 94, 062304, 2016 [2] David Headley, Thorge Müller, Ana Martin, Enrique Solano, Mikel Sanz, Frank K. Wilhelm, arXiv:2002.12215 [3] Shai Machnes, Nicolas Wittler, Federico Roy, Kevin Pack, Anurag Sasha-Roy, and Frank K. Wilhelm, methodology for Control, Calibration and Characterization of quantum devices,applied to superconducting qubits, in preparation
Quantum Computing With Stabilized Cat Qubits

Alexandre Blais (Universite de Sherbrooke)
May 08, 2020

15633
Computer Science
Since the first observation 20 years ago of first coherent quantum behaviour in a superconducting qubit there have been significant developments in the field of superconducting quantum circuits. With improvements of coherence times by over 5 orders of magnitude, it is now possible to execute increasingly complex quantum algorithms with these circuits. Despite these successes, the coherence time of superconducting devices must still be increased for quantum computation to become a reality. One approach is to improve existing devices. Another approach is to design new superconducting qubits with intrinsic protection against certain types of errors. In this talk, I will discuss how quantum information can be robustly encoded in cat states of the electromagnetic field stored in superconducting quantum devices. A feature of this encoding is that it exhibits biased noise. I will present how to realize bias-preserving gates on this qubit, and how these ideas can be further improved with quantum error correction.
Obstacles to State Preparation and Variational Optimization from Symmetry Protection

Robert König (Technical University of Munich)
May 07, 2020

15632
Computer Science
Local Hamiltonians with topological quantum order exhibit highly entangled ground states that cannot be prepared by shallow quantum circuits. Here, we show that this property may extend to all low-energy states in the presence of an on-site Z2 symmetry. This proves a version of the No Low-Energy Trivial States (NLTS) conjecture for a family of local Hamiltonians with symmetry protected topological order. A surprising consequence of this result is that the Goemans-Williamson algorithm outperforms the Quantum Approximate Optimization Algorithm (QAOA) for certain instances of MaxCut, at any constant level. We argue that the locality and symmetry of QAOA severely limits its performance. To overcome these limitations, we propose a non-local version of QAOA, and give numerical evidence that it significantly outperforms standard QAOA for frustrated Ising models on random 3-regular graphs. This is joint work with Sergey Bravyi, Alexander Kliesch and Eugene Tang.

Title	Speaker(s)	Date	Info link
Corruption Robust Exploration in Episodic Reinforcement Learning	Aleksandrs Slivkins (Microsoft Research NYC)	2020-10-30	View details
Representation Learning and Exploration in Reinforcement Learning	Akshay Krishnamurthy (Microsoft Research)	2020-10-30	View details
Multiplayer Bandit Learning - From Competition to Cooperation	Simina Branzei (Purdue University)	2020-10-29	View details
Multi-Player Multi-Armed Bandit: Can We Still Collaborate at Homes Without "Zoom"?	Yuanzhi Li (Carnegie Mellon University)	2020-10-29	View details
Country-Scale Bandit Implementation for Targeted COVID-19 Testing	Hamsa Bastani (Wharton School of the University of Pennsylvania)	2020-10-29	View details
Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization	Negin Golrezaei (MIT)	2020-10-29	View details
Beating the Curse of Dimensionality in High-Dimensional Optimal Stopping	David Goldberg (Cornell ORIE)	2020-10-29	View details
Learning Outcomes in Queueing Systems	Eva Tardos (Cornell)	2020-10-26	View details
Panel Discussion	Isaac Chuang (MIT), John Martinis (UC Santa Barbara), John Preskill (Caltech), and Thomas Vidick (Caltech)	2020-05-08	View details
Co-Design of Almost Universal Quantum Computers	Frank Wilhelm-Mauch (Saarland University)	2020-05-08	View details
Quantum Computing With Stabilized Cat Qubits	Alexandre Blais (Universite de Sherbrooke)	2020-05-08	View details
Obstacles to State Preparation and Variational Optimization from Symmetry Protection	Robert König (Technical University of Munich)	2020-05-07	View details

Supported by

Format results

Corruption Robust Exploration in Episodic Reinforcement Learning

Representation Learning and Exploration in Reinforcement Learning

Multiplayer Bandit Learning - From Competition to Cooperation

Multi-Player Multi-Armed Bandit: Can We Still Collaborate at Homes Without "Zoom"?

Country-Scale Bandit Implementation for Targeted COVID-19 Testing

Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization

Beating the Curse of Dimensionality in High-Dimensional Optimal Stopping

Learning Outcomes in Queueing Systems

Panel Discussion

Co-Design of Almost Universal Quantum Computers

Quantum Computing With Stabilized Cat Qubits

Obstacles to State Preparation and Variational Optimization from Symmetry Protection

Corruption Robust Exploration in Episodic Reinforcement Learning

Representation Learning and Exploration in Reinforcement Learning

Multiplayer Bandit Learning - From Competition to Cooperation

Multi-Player Multi-Armed Bandit: Can We Still Collaborate at Homes Without "Zoom"?

Country-Scale Bandit Implementation for Targeted COVID-19 Testing

Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization

Beating the Curse of Dimensionality in High-Dimensional Optimal Stopping

Learning Outcomes in Queueing Systems

Panel Discussion

Co-Design of Almost Universal Quantum Computers

Quantum Computing With Stabilized Cat Qubits

Obstacles to State Preparation and Variational Optimization from Symmetry Protection