We consider a class of supervised learning problems whereby we are given n data points (y_i,x_i), with x_i a d-dimensional dfeature factor, y_i a response, and the model is parametrized by a vector of dimension kd. We consider the high-dimensional asymptotics in which n,d diverge, with n/d and k of order one. As a special case, this class of models includes neural networks with k hidden neurons.
I will present two sets of results:
1. Universality of certain properties of empirical risk minimizers with respect to the distribution of the feature vectors x_i.
2. A sharp asymptotic characterization of gradient flow in terms of a one-dimensional stochastic process.
[Based on joint work with Michael Celentano, Chen Cheng, Basil Saeed]

We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum l1-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings. Joint work with Lijia Zhou, Danica Sutherland, and Nathan Srebro.

Classical matrix concentration inequalities are sharp up to a logarithmic factor. This logarithmic factor is necessary in the commutative case but unnecessary in many classical noncommutative cases. We will present some matrix concentration results that are sharp in many cases, where we overcome this logarithmic factor by using an easily computable quantity that captures noncommutativity. Joint work with Afonso Bandeira and Ramon van Handel. Paper: https://arxiv.org/abs/2108.06312

Most theoretical guarantees for stochastic gradient descent (SGD) assume that the iterates are averaged, that the stepsizes are decreasing, and/or that the objective is regularized. However, practice shows that these tricks are less necessary than theoretically expected. I will present an analysis of SGD that uses none of these tricks: we analyze the behavior of the last iterate of fixed step-size, non-regularized SGD. Our results apply for kernel regression, i.e., infinite-dimensional linear regression. As a special case, we analyze an online algorithm for estimating a real function on the unit interval from the observation of its value at randomly sampled points.

We consider a binary classification problem where the data comes from a mixture of two rotationally symmetric log-concave distributions. We analyze the dynamics of a nonconvex gradient-based self-training algorithm that utilizes a pseudolabeler to generate labels for unlabeled data and updates the pseudolabeler based on a "supervised" loss that treats the pseoudolabels as if they were ground truth-labels. We show that provided the initial pseudolabeler has classification error smaller than some absolute constant, then self-training produces classifiers with error at most epsilon greater than that of the Bayes-optimal classifier using O(d/epsilon^2) unlabeled examples. That is, self-training converts weak learners to strong learners using only unlabeled examples. We show that gradient descent on the logistic loss with O(d) labeled examples (i.e., independent of epsilon) suffices to produce a pseudolabeler such that the aforementioned results hold.

Graph neural networks (GNNs) generalize convolutional neural networks (CNNs) by using graph convolutions that enable information extraction from non-Euclidian domains, e.g., network data. These graph convolutions combine information from adjacent nodes using coefficients that are shared across all nodes. Since these coefficients do not depend on the graph, one can envision using the same coefficients to define a GNN on a different graph. In this talk, I will tackle this problem by introducing graphon NNs as limit objects of sequences of GNNs and characterizing the difference between the output of a GNN and its limit graphon-NN. This bound vanishes as the number of nodes grows as long as the graph convolutional filters are bandlimited in the graph spectral domain. This establishes a tradeoff between discriminability and transferability of GNNs and sheds light on the effect of training using smaller (possibly corrupted) graph convolutions.

It is known that arbitrary poly-size neural networks trained by GD/SGD can learn in class in SQ/PAC. This is however not expected to hold for more regular architectures and initializations. Recently, the staircase property emerged as a condition that seems both necessary and sufficient for certain regular networks to learn with high accuracy, with the positive result established for sparse homogeneous initializations. In this talk, we show that standard two-layer architectures can also learn staircases with features being learned over time. It is also shown that kernels cannot learn staircases of growing degree. Joint work with Enric Boix-Adsera and Theodor Misiakiewicz.

For different parameterizations (mappings from parameters to predictors), we study the regularization cost in predictor space induced by l_2 regularization on the parameters (weights). We focus on linear neural networks as parameterizations of linear predictors and identify the representation cost of certain sparse linear ConvNets and residual networks. In order to get a better understanding of how the architecture and parameterization affect the representation cost, we also study the reverse problem, identifying which regularizers on linear predictors (e.g., l_p quasi-norms, group quasi-norms, the k-support-norm, elastic net) can be the representation cost induced by simple l_2 regularization, and designing the parameterizations that do so.

In this short talk, I will share some recent progress on the hardness of learning shallow RELU neural networks (Relu-NN) and polynomially small adversarial noise. We will present a result that efficiently learning an 1-hidden layer Relu-NN under Gaussian input and adversarial noise is "cryptographically hard", in the sense that it implies a polynomial-time quantum algorithm for the worst-case shortest vector problem.

This lecture will introduce Bayesian networks and their causal interpretation as causal graphical models, d-separation, the do-calculus, and the Shpitser-Pearl ID algorithm. We'll start by introducing Bayesian networks, causal graphical models, and interventions. We'll then show that two Bayesian networks with the same skeletons and v-structures represent the same conditional independence assumptions and prove that the d-separations present in any Bayesian network exhaust all the conditional independence assumptions that are guaranteed to hold in any distribution that factorizes according to that network. We then turn to causal models and introduce the do-calculus. We show the soundness of each of the rules of the do-calculus. Finally, we describe the Shpitser-Pearl algorithm for identifying causal effects in semi-Markovian models and, time-permitting, prove that the ID algorithm is complete for identifying causal effects; that is, a causal effect is identifiable if and only if the ID algorithm terminates successfully.

This lecture will introduce Bayesian networks and their causal interpretation as causal graphical models, d-separation, the do-calculus, and the Shpitser-Pearl ID algorithm. We'll start by introducing Bayesian networks, causal graphical models, and interventions. We'll then show that two Bayesian networks with the same skeletons and v-structures represent the same conditional independence assumptions and prove that the d-separations present in any Bayesian network exhaust all the conditional independence assumptions that are guaranteed to hold in any distribution that factorizes according to that network. We then turn to causal models and introduce the do-calculus. We show the soundness of each of the rules of the do-calculus. Finally, we describe the Shpitser-Pearl algorithm for identifying causal effects in semi-Markovian models and, time-permitting, prove that the ID algorithm is complete for identifying causal effects; that is, a causal effect is identifiable if and only if the ID algorithm terminates successfully.