It is known that arbitrary poly-size neural networks trained by GD/SGD can learn in class in SQ/PAC. This is however not expected to hold for more regular architectures and initializations. Recently, the staircase property emerged as a condition that seems both necessary and sufficient for certain regular networks to learn with high accuracy, with the positive result established for sparse homogeneous initializations. In this talk, we show that standard two-layer architectures can also learn staircases with features being learned over time. It is also shown that kernels cannot learn staircases of growing degree. Joint work with Enric Boix-Adsera and Theodor Misiakiewicz.
Classical matrix concentration inequalities are sharp up to a logarithmic factor. This logarithmic factor is necessary in the commutative case but unnecessary in many classical noncommutative cases. We will present some matrix concentration results that are sharp in many cases, where we overcome this logarithmic factor by using an easily computable quantity that captures noncommutativity. Joint work with Afonso Bandeira and Ramon van Handel. Paper: https://arxiv.org/abs/2108.06312
Most theoretical guarantees for stochastic gradient descent (SGD) assume that the iterates are averaged, that the stepsizes are decreasing, and/or that the objective is regularized. However, practice shows that these tricks are less necessary than theoretically expected. I will present an analysis of SGD that uses none of these tricks: we analyze the behavior of the last iterate of fixed step-size, non-regularized SGD. Our results apply for kernel regression, i.e., infinite-dimensional linear regression. As a special case, we analyze an online algorithm for estimating a real function on the unit interval from the observation of its value at randomly sampled points.
We consider a binary classification problem where the data comes from a mixture of two rotationally symmetric log-concave distributions. We analyze the dynamics of a nonconvex gradient-based self-training algorithm that utilizes a pseudolabeler to generate labels for unlabeled data and updates the pseudolabeler based on a "supervised" loss that treats the pseoudolabels as if they were ground truth-labels. We show that provided the initial pseudolabeler has classification error smaller than some absolute constant, then self-training produces classifiers with error at most epsilon greater than that of the Bayes-optimal classifier using O(d/epsilon^2) unlabeled examples. That is, self-training converts weak learners to strong learners using only unlabeled examples. We show that gradient descent on the logistic loss with O(d) labeled examples (i.e., independent of epsilon) suffices to produce a pseudolabeler such that the aforementioned results hold.
Graph neural networks (GNNs) generalize convolutional neural networks (CNNs) by using graph convolutions that enable information extraction from non-Euclidian domains, e.g., network data. These graph convolutions combine information from adjacent nodes using coefficients that are shared across all nodes. Since these coefficients do not depend on the graph, one can envision using the same coefficients to define a GNN on a different graph. In this talk, I will tackle this problem by introducing graphon NNs as limit objects of sequences of GNNs and characterizing the difference between the output of a GNN and its limit graphon-NN. This bound vanishes as the number of nodes grows as long as the graph convolutional filters are bandlimited in the graph spectral domain. This establishes a tradeoff between discriminability and transferability of GNNs and sheds light on the effect of training using smaller (possibly corrupted) graph convolutions.
We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum l1-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings. Joint work with Lijia Zhou, Danica Sutherland, and Nathan Srebro.
Symmetries play a unifying role in physics and many other sciences. In deep learning, symmetries have been incorporated into neural networks through the concept of equivariance. One of the major benefits is that it will reduce the number parameters through parameter sharing and as such can learn with less data. In this talk I will ask the question, can equivariance also help in RL? Besides the obvious idea of using equivariant value functions, we explore the idea of deep equivariant policies. We make a connection between equivariance and MDP homomorphisms, and generalize to distributed multi-agent settings.
Joint work with Elise van der Pol (main contributor), Herke van Hoof and Frans Oliehoek.
We report on the geometric structure of symmetry adapted PSD cones and symmetry adapted Gram spectrahedra of symmetric polynomials. We determine the dimension of symmetry adapted PSD cones, describe its extreme rays, and discuss the structure of its matrix representations. We also focus on symmetry adapted Gram spectrahedra of symmetric binary forms, quadrics, ternary quartics and sextics. In particular, we characterize extreme points of these spectrahedra for symmetric binary forms that are of rank two, and we report what we know about the facial structure of the same spectrahedra. The talk will be based on two collaborations, one with Alex Heaton and Isabelle Shankar, and another one with Matthew Heid.
Tensor network states form a variational ansatz class widely used in the study of quantum many-body systems. Geometrically, these states form an algebraic variety of tensors with rich representation theoretic structure. It is known that tensors on the "boundary" of this variety can provide more efficient representations for states of physical interest, but the pathological geometric properties of the boundary make it difficult to extend the classical optimization methods. In recent work, we introduced a new ansatz class which includes states at the boundary of the tensor network variety. I will present some of the geometric features of this class and explain how it can be used in the variational approach. This is based on joint work with M. Christandl, D. Stilck-Franca and A. Werner.
Geometrically, a linear program is a polyhedron together with an orientation of its graph. A simplex method determines a path from any given starting vertex to the sink. The centerpiece of any simplex algorithm is the pivot rule that successively selects outgoing edges along the path. We introduce normalized-weight pivot rules, a class of pivot rules that are memory-less, that subsume many of the most-used pivot rules, and that can be parametrized in a natural continuous manner. We introduce two polytopes that capture the behavior of normalized-pivot rules on polytopes and linear programs. On the one hand, this gives a new perspective on the performance of pivot rules. On the other hand, our constructions generalize classical constructions (e.g. monotone path polytopes) and objects (e.g. permutahedra, associahedra, multiplihedra) from geometric combinatorics, This is joint work with Alex Black, Jesus De Loera, and Niklas Lütjeharms.