Today’s computing systems consist of servers that offer highly heterogeneous service rates due to the diverse machine types and the dynamic computing environment. In such systems, to guarantee low latency, load-balancing algorithms need to consider not only the amount of work on servers but also their service rates. However, because of the dynamic nature of the computing environment, the service rates cannot be obtained beforehand and thus have to be learned on the fly. In this talk, we consider a load-balancing system that consists of multiple servers with unknown service rates. Each incoming job needs to be assigned to one of the servers upon arrival. Our goal is to develop learning-integrated job-assigning algorithms that perform comparably to their counterparts under known service rates. Although some facets of this problem resemble the well-studied multi-armed bandit problem, our findings reveal that the exploration component is fundamentally different. In particular, we develop a meta-algorithm that integrates learning into a large class of load-balancing algorithms. The proposed algorithm uses almost no exploration, yet it achieves a constant regret in the accumulated latency of jobs. A similar free exploration phenomenon has been observed in a prior paper, but in a different setting with a single server. This is joint work with Yifei Huang and Zhiyuan Tang, both from Tsinghua University.

In this work, we study policy-space methods for solving the reinforcement learning (RL) problem. We first consider the setting where the model is known (MDP setting) and show that the Natural Policy Gradient (NPG) algorithm (and other variants) have linear (geometric) convergence without the need of any regularization. In contrast to optimization approaches used in the literature, our approach is based on approximate dynamic programing. We then consider a linear function approximation variant of it and establish its convergence guarantees. Finally, we consider the RL setting where a critic is deployed for policy evaluation when the model is unknown. We consider a critic that is based on TD learning, and uses off-policy sampling with linear function approximation. This leads to the infamous deadly-triad issue. We propose a generic algorithm framework of a single time-scale multi-step TD-learning with generalized importance sampling ratios that enables us to overcome the high variance issue in off-policy learning, and establish its finite-sample guarantees. We show that this leads to an overall sample complexity of O(epsilon^-2).

What makes online decision-making different from other decision-making/optimization problems? While it seems clear that the unique features are the sequential nature of taking actions and uncertainty in future outcomes, most techniques for solving such problems tend to obfuscate these features - so are these the best ways to think about these settings? I will present the compensated coupling: a simple paradigm for reasoning about and designing online decision-making policies, based on a sample-pathwise accounting of their performance compared to some benchmark policy. This approach generalizes many standard results used in studying Markov decision processes and reinforcement learning, but also gives us new policies which are much simpler and more effective than existing heuristics. For a large class of widely-studied control problems including online resource-allocation, dynamic pricing, generalized assignment, online bin packing, and bandits with knapsacks, I will illustrate how these new algorithms achieve constant regret (i.e., additive loss compared to the hindsight optimal which is independent of the horizon and state-space) under a wide range of conditions. Time permitting, I will try and describe how we can use this technique to incorporate side information and historical data in these settings, and achieve constant regret with as little as a single data trace.

A graph is said to be a (1-dimensional) expander if the second eigenvalue of its adjacency matrix is bounded away from 1, or almost-equivalently, if it has no sparse vertex cuts. There are several natural ways to generalize the notion of expansion to hypergraphs/simplicial complexes, but one such way is 2-dimensional spectral expansion, in which the expansion of vertex neighborhoods (remarkably) witnesses global expansion. While 1-dimensional expansion is known to be achieved by, e.g., random regular graphs, very few constructions of sparse 2-dimensional expanders are known, and at present all are algebraic. It is an open question whether sparse 2-dimensional expanders are "abundant" and "natural" or "rare." In this talk, we'll give some evidence towards abundance: we show that the set of triangles in a random geometric graph on a high-dimensional sphere yields an expanding simplicial complex of arbitrarily small polynomial degree.

Sparse linear regression is a fundamental problem in high-dimensional statistics, but strikingly little is known about how to solve it efficiently without restrictive conditions on the design matrix. This talk will focus on the (correlated) random design setting, where the covariates are independently drawn from a multivariate Gaussian and the ground-truth signal is sparse. Information theoretically, one can achieve strong error bounds with O(klogn) samples for arbitrary covariance matrices and sparsity k; however, no efficient algorithms are known to match these guarantees even with o(n) samples in general. As for hardness, computational lower bounds are only known with worst-case design matrices. Random-design instances are known which are hard for the Lasso, but these instances can generally be solved by Lasso after a simple change-of-basis (i.e. preconditioning). In the talk, we will discuss new upper and lower bounds clarifying the power of preconditioning in sparse linear regression. First, we show that the preconditioned Lasso can solve a large class of sparse linear regression problems nearly optimally: it succeeds whenever the dependency structure of the covariates, in the sense of the Markov property, has low treewidth -- even if the covariance matrix is highly ill-conditioned. Second, we construct (for the first time) random-design instances which are provably hard for an optimally preconditioned Lasso. Based on joint works with Jonathan Kelner, Frederic Koehler, Dhruv Rohatgi.

Over the past few years, Transformers have revolutionized deep learning, leading to advances in natural language processing and beyond. These models discard recurrence and convolutions, in favor of "self-attention" which directly and globally models interactions within the input context. Despite their success, currently there is limited understanding of why they work. In this talk, I will present our recent results on rigorously quantifying the statistical and representational properties of Transformers which shed light on their ability to capture long range dependencies efficiently. First, I will show how bounded-norm self-attention layers can represent arbitrary sparse functions of the input sequence, with sample complexity scaling only logarithmically with the context length, akin to sparse regression. Subsequently, I will briefly show how this ability of self-attention to compute sparse functions along with its ability to compute averages can be used to construct Transformers that exactly replicate the dynamics of a recurrent model of computation depth $T$ with only $o(T)$ depth. I will conclude the talk with experimental results on synthetic tasks based on learning Boolean functions and automata. Based on joint works with Jordan T. Ash, Ben L. Edelman, Sham M. Kakade, Akshay Krishnamurthy, Bingbin Liu, and Cyril Zhang.

Abstract
Variational methods are extremely popular in the analysis of network data. Statistical guarantees obtained for these methods typically provide asymptotic normality for the problem of estimation of global model parameters under the stochastic block model. In the present work, we consider the case of networks with missing links that is important in application and show that the variational approximation to the maximum likelihood estimator converges at the minimax rate. This provides the first minimax optimal and tractable estimator for the problem of parameter estimation for the stochastic block model. We complement our results with numerical studies of simulated and real networks, which confirm the advantages of this estimator over current methods.

Abstract
Graph neural networks (GNNs) are successful at learning representations from most types of network data but suffer from limitations in large graphs, which do not have the Euclidean structure that time and image signals have in the limit. Yet, large graphs can often be identified as being similar to each other in the sense that they share structural properties. Indeed, graphs can be grouped in families converging to a common graph limit -- the graphon. A graphon is a bounded symmetric kernel which can be interpreted as both a random graph model and a limit object of a convergent sequence of graphs. Graphs sampled from a graphon almost surely share structural properties in the limit, which implies that graphons describe families of similar graphs. We can thus expect that processing data supported on graphs associated with the same graphon should yield similar results. In this talk, I formalize this intuition by showing that the error made when transferring a GNN across two graphs in a graphon family is small when the graphs are sufficiently large. This enables large-scale graph machine learning by transference: training GNNs on moderate-scale graphs and executing them on large-scale graphs.

Abstract
The theory of graph limits is an important tool in understanding properties of large networks. We begin the talk with a survey of this theory, concentrating in particular on the sparse setting. We then investigate a power-law random graph model and cast it in the sparse graph limit theory framework. The distinctively different structures of the limit graph are explored in detail in the sub-critical and super-critical regimes. In the sub-critical regime, the graph is empty with high probability, and in the rare event that it is non-empty, it consists of a single edge. Contrarily, in the super-critical regime, a non-trivial random graph exists in the limit, and it serves as an uncovered boundary case between different types of graph convergence.

Abstract
The goal of this talk is to describe probabilistic approaches to three major problems for dynamic networks, both of which are intricately connected to long range dependence in the evolution of such models:
1.Nonparametric change point detection: Consider models of growing networks which evolve via new vertices attaching to the pre-existing network according to one attachment function $f$ till the system grows to size $τ(n) < n$, when new vertices switch their behavior to a different function g till the system reaches size n. The goal is to estimate the change point given the observation of the networks over time with no knowledge of the functions driving the dynamics. We will describe non-parametric estimators for such problems.
2. Detecting the initial seed which resulted in the current state of the network: Imagine observing a static time slice of the network after some large time $n$ started with an initial seed. Suppose all one gets to see is the current topology of the network (without any label or age information). Developing probably efficient algorithms for estimating the initial seed has inspired intense activity over the last few years in the probability community. We will describe recent developments in addressing such questions including robustness results such as the fixation of so called hub vertices as time evolves.
3. Co-evolving networks: models of networks where dynamics on the network (directed random walks towards the root) modifies the structure of the network which then impacts behavior of subsequent dynamics. We will describe non-trivial phase transitions of condensation around the root and its connection to quasi-stationary distributions of 1-d random walks.

Abstract
We consider interacting particle systems on suitable convergent sequences of sparse (or heterogeneous graphs) and show that the limiting dynamics of the associated neighborhood empirical measure process
(the so-called hydrodynamic limit) can be autonomously characterized in terms of a non-Markovian process. We then describe Markovian approximations to the latter and provide examples where they are exact. This includes joint work with G. Cocomello and A. Ganguly.

Abstract
We consider an inhomogeneous Erd\H{o}s-R\'enyi random graph $G_N$ with vertex set $[N] = \{1,\dots,N\}$ for which the pair of vertices $i,j \in [N]$, $i\neq j$, is connected by an edge with probability $r(\tfrac{i}{N},\tfrac{j}{N})$, independently of other pairs of vertices. Here, $r\colon\,[0,1]^2 \to (0,1)$ is a symmetric function that plays the role of a reference graphon.
(1) Let $\lambda_N$ be the maximal eigenvalue of the \emph{adjacency matrix} of $G_N$. It is known that $\lambda_N/N$ satisfies an LDP with rate $N$ as $N \to \infty$. The associated rate function $\psi_r$ is given by a variational formula that involves the rate function $I_r$ of a large deviation principle on graphon space. We analyse this variational formula in order to identify the basic properties of $\psi_r$, especially when the reference graphon $r$ is of rank 1.
(2) Let $\hat\lambda_N$ be the maximal eigenvalue of the \emph{Laplacian matrix} of $G_N$. We show that $\hat\lambda_N/N$ satisfies a downward LDP with rate $\binom{N}{2}$ and an upward LDP with rate $N$. The associated rate functions $\hat\psi_r$ and $\hat\psi^*_r$ are those of the downward LDP and upward LDP for the maximal degree in the graph. We identify the basic properties of both $hat\psi_r$ and $\hat\psi^*_r$.