Format results
- Preetum Nakkiran (UCSD)
The Spectrum of Nonlinear Random Matrices for Ultra-Wide Neural Networks
Yizhe Zhu (University of California, Irvine)Exact Asymptotics and Universality for Gradient Flows and Empirical Risk Minimizers
Andrea Montanari (Stanford University)Paleo-Detectors - Digging for Dark Matter and Neutrinos
Sebastian Baum Nordita - Nordic Institute for Theoretical Physics
Learning Staircases
Emmanuel Abbe (École polytechnique fédérale de Lausanne), Enric Boix (MIT), Theodor Misiakiewicz (Stanford University)Large Scale Structure Beyond the 2-Point Function
Oliver Philcox Columbia University
Sharp Matrix Concentration
March Boedihardjo (University of California, Irvine)Non-Parametric Convergence Rates for Plain Vanilla Stochastic Gradient Descent
Raphaël Berthier (École polytechnique fédérale de Lausanne)Self-Training Converts Weak Learners to Strong Learners in Mixture Models
Spencer Frei (UC Berkeley)
Is Overfitting Actually Benign? On the Consistency of Interpolating Methods
Preetum Nakkiran (UCSD)This talk will be very informal: I will discuss ongoing research with open questions and partial results. We study the asymptotic consistency of modern interpolating methods, including deep networks, in both classification and regression settings. That is, we consider learning methods which scale the data size while simultaneously scaling the model size, in a way which always interpolates the train set. We present empirical evidence that, perhaps contrary to intuitions in theory, many natural interpolating learning methods are *inconsistent* for a wide variety of distributions. That is, they do not approach Bayes-optimality even in the limit of infinite data. The message is, in settings with nonzero Bayes risk, overfitting is not benign: interpolating the noise significantly harms the classifier, to the point of preventing consistency. This work is motivated by: (1) understanding differences between the overparameterized and underparameterized regime, (2) guiding theory towards more "realistic" assumptions to capture deep learning practice, and (3) understanding common structure shared by "natural" interpolating methods. Based on joint work with: Neil Mallinar, Amirhesam Abedsoltan, Gil Kur, and Misha Belkin. And is a follow-up work to the paper "Distributional Generalization", joint with Yamini Bansal.
Subdiffusion and ergodicity breaking in systems with emergent or microscopic dipole-moment conservation
Alan Morningstar Citadel LLC
I will first give a brief overview of my research in the field of out-of-equilibrium quantum many-
body physics, ranging from the theory of many-body localization, to the recent application of TensorProcessing Units for accelerating simulations of quantum dynamics. I’ll then focus on (1) the
experimental observation and theoretical explanation of subdiffusive dynamics in a “tilted” Fermi-
Hubbard system [PRX 10, 011042 (2020)], and (2) a “freezing” phase transition between weak andstrong ergodicity breaking in systems with particles that are immobile by themselves, but undergo
coordinated pair hopping [PRB 101, 214205 (2020)]. These topics contain the common thread of
either an emergent or microscopic conservation of the dipole moment (center of mass of the particle
distribution), and I will provide simple pictures for how this leads to the subdiffusion and ergodicity
breaking.Universality of Neural Networks
Dan Mikulincer (MIT)It is well known that, at a random initialization, as their width approaches infinity, neural networks can be well approximated by Gaussian processes. We quantify this phenomenon by providing non-asymptotic convergence rates in the space of continuous functions. In the process, we study the Central Limit Theorem in high and infinite dimensions, as well as anti-concentration properties of polynomials with random variablesThe Spectrum of Nonlinear Random Matrices for Ultra-Wide Neural Networks
Yizhe Zhu (University of California, Irvine)We obtain limiting spectral distribution of empirical conjugate kernel and neural tangent kernel matrices for two-layer neural networks with deterministic data and random weights. When the width of the network grows faster than the size of the dataset, a deformed semicircle law appears. In this regime, we also calculate the asymptotic test and training errors for random feature regression. Joint work with Zhichao Wang (UCSD).Exact Asymptotics and Universality for Gradient Flows and Empirical Risk Minimizers
Andrea Montanari (Stanford University)We consider a class of supervised learning problems whereby we are given n data points (y_i,x_i), with x_i a d-dimensional dfeature factor, y_i a response, and the model is parametrized by a vector of dimension kd. We consider the high-dimensional asymptotics in which n,d diverge, with n/d and k of order one. As a special case, this class of models includes neural networks with k hidden neurons. I will present two sets of results: 1. Universality of certain properties of empirical risk minimizers with respect to the distribution of the feature vectors x_i. 2. A sharp asymptotic characterization of gradient flow in terms of a one-dimensional stochastic process. [Based on joint work with Michael Celentano, Chen Cheng, Basil Saeed]Paleo-Detectors - Digging for Dark Matter and Neutrinos
Sebastian Baum Nordita - Nordic Institute for Theoretical Physics
Paleo-Detectors are natural minerals which record damage tracks from nuclear recoils over geological timescales. Minerals commonly found on Earth are as old as a billion years, and modern microscopy techniques may allow to reconstruct damage tracks with nanometer scale spatial resolution. Thus, paleo-detectors would constitute a technique to achieve keV recoil energy threshold with exposures comparable to a kiloton-scale conventional "real-time" detector. In this talk, I will discuss the potential of paleo-detectors for the direct detection of dark matter as well as for detecting low-energy neutrinos as are e.g. emitted by core collapse supernovae or our Sun. Furthermore, the age of the minerals provides the ability to look back across gigayear-timescales, giving paleo detectors the unique ability to probe changes in the cosmic ray rate or the galactic supernova rate over such timescales as well as dark matter substructure Earth might have encountered during its past few trips around our Galaxy.
Learning Staircases
Emmanuel Abbe (École polytechnique fédérale de Lausanne), Enric Boix (MIT), Theodor Misiakiewicz (Stanford University)It is known that arbitrary poly-size neural networks trained by GD/SGD can learn in class in SQ/PAC. This is however not expected to hold for more regular architectures and initializations. Recently, the staircase property emerged as a condition that seems both necessary and sufficient for certain regular networks to learn with high accuracy, with the positive result established for sparse homogeneous initializations. In this talk, we show that standard two-layer architectures can also learn staircases with features being learned over time. It is also shown that kernels cannot learn staircases of growing degree. Joint work with Enric Boix-Adsera and Theodor Misiakiewicz.Large Scale Structure Beyond the 2-Point Function
Oliver Philcox Columbia University
Quantum fluctuations in inflation provide the seeds for the large scale distribution of matter today. According to the standard paradigm, these fluctuations induce density perturbations that are adiabatic and Gaussian distributed. In this limit, all the information is contained within the two-point correlation function, or equivalently, the power spectrum. Today, the distribution of matter is far from Gaussian, with structures forming across a vast range of scales. Despite this, almost all analyses of observational data are performed using two-point functions. This begs the question: what information lies in higher-point statistics?
In this seminar, I will present a pedagogical overview of the non-Gaussian correlation functions, and demonstrate how they can be used both to sharpen constraints on known physical parameters, and to provide stringent tests of new physics occurring in the early Universe. One of the major barriers to constraining cosmology from the higher-point functions is computational: measuring the statistics with conventional techniques is infeasible for current and future datasets. I will discuss new methods capable of reducing the computational cost by orders of magnitude, and show how this facilitates a number of exciting new tests of the cosmological model.
Sharp Matrix Concentration
March Boedihardjo (University of California, Irvine)Classical matrix concentration inequalities are sharp up to a logarithmic factor. This logarithmic factor is necessary in the commutative case but unnecessary in many classical noncommutative cases. We will present some matrix concentration results that are sharp in many cases, where we overcome this logarithmic factor by using an easily computable quantity that captures noncommutativity. Joint work with Afonso Bandeira and Ramon van Handel. Paper: https://arxiv.org/abs/2108.06312Non-Parametric Convergence Rates for Plain Vanilla Stochastic Gradient Descent
Raphaël Berthier (École polytechnique fédérale de Lausanne)Most theoretical guarantees for stochastic gradient descent (SGD) assume that the iterates are averaged, that the stepsizes are decreasing, and/or that the objective is regularized. However, practice shows that these tricks are less necessary than theoretically expected. I will present an analysis of SGD that uses none of these tricks: we analyze the behavior of the last iterate of fixed step-size, non-regularized SGD. Our results apply for kernel regression, i.e., infinite-dimensional linear regression. As a special case, we analyze an online algorithm for estimating a real function on the unit interval from the observation of its value at randomly sampled points.Quantum Scientific Computation
Jin-Peng Liu University of New Mexico
Quantum computers are expected to dramatically outperform classical computers for certain computational problems. While there has been extensive previous work for linear dynamics and discrete models, for more complex realistic problems arising in physical and social science, engineering, and medicine, the capability of quantum computing is far from well understood. One fundamental challenge is the substantial difference between the linear dynamics of a system of qubits and real-world systems with continuum, stochastic, and nonlinear behaviors. Utilizing advanced linear algebra techniques and nonlinear analysis, I attempt to build a bridge between classical and quantum mechanics, understand and optimize the power of quantum computation, and discover new quantum speedups over classical algorithms with provable guarantees. In this talk, I would like to cover quantum algorithms for scientific computational problems, including topics such as linear, nonlinear, and stochastic differential equations, with applications in areas such as quantum dynamics, biology and epidemiology, fluid dynamics, and finance.
Reference:
Quantum spectral methods for differential equations, Communications in Mathematical Physics 375, 1427-1457 (2020), https://arxiv.org/abs/1901.00961
High-precision quantum algorithms for partial differential equations, Quantum 5, 574 (2021), https://arxiv.org/abs/2002.07868
Efficient quantum algorithm for dissipative nonlinear differential equations, Proceedings of the National Academy of Sciences 118, 35 (2021), https://arxiv.org/abs/2011.03185
Quantum-accelerated multilevel Monte Carlo methods for stochastic differential equations in mathematical finance, Quantum 5, 481 (2021), https://arxiv.org/abs/2012.06283Self-Training Converts Weak Learners to Strong Learners in Mixture Models
Spencer Frei (UC Berkeley)We consider a binary classification problem where the data comes from a mixture of two rotationally symmetric log-concave distributions. We analyze the dynamics of a nonconvex gradient-based self-training algorithm that utilizes a pseudolabeler to generate labels for unlabeled data and updates the pseudolabeler based on a "supervised" loss that treats the pseoudolabels as if they were ground truth-labels. We show that provided the initial pseudolabeler has classification error smaller than some absolute constant, then self-training produces classifiers with error at most epsilon greater than that of the Bayes-optimal classifier using O(d/epsilon^2) unlabeled examples. That is, self-training converts weak learners to strong learners using only unlabeled examples. We show that gradient descent on the logistic loss with O(d) labeled examples (i.e., independent of epsilon) suffices to produce a pseudolabeler such that the aforementioned results hold.