Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove that, for a class of well-behaved input distributions in high dimension, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs. We additionally argue that this surprising simplicity can persist in networks with more layers and with convolutional architecture, which we verify empirically. Key to our analysis is to bound the spectral norm of the difference between the Neural Tangent Kernel (NTK) at initialization and an affine transform of the data kernel; however, unlike many previous results utilizing the NTK, we do not require the network to have disproportionately large width, and the network is allowed to escape the kernel regime later in training. Link to paper: https://arxiv.org/abs/2006.14599
Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the dynamics of the optimizer. I will then demonstrate a manifestation of this approach, analyzing the dynamics of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank (different from any type of norm regularization), and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix completion. Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu, Yuping Luo and Noam Razin.
Generative adversarial networks (GANs) are a widely used framework for learning generative models. Wasserstein GANs (WGANs), one of the most successful variants of GANs, require solving a min-max optimization problem to global optimality but are in practice successfully trained using stochastic gradient descent-ascent. In this talk, we show that, when the generator is a one-layer network, stochastic gradient descent-ascent converges to a global solution with polynomial time and sample complexity.
Past five years have witnessed a sequence of successes in designing efficient algorithms for statistical estimation tasks when the input data is corrupted with a constant fraction of fully malicious outliers. The Sum-of-Squares (SoS) method has been an integral part of this story and is behind robust learning algorithms for tasks such as estimating the mean, covariance, and higher moment tensors of a broad class of distributions, clustering and parameter estimation for spherical and non-spherical mixture models, linear regression, and list-decodable learning.
In this talk, I will attempt to demystify this (unreasonable?) effectiveness of the SoS method in robust statistics. I will argue that the utility of the SoS algorithm in robust statistics can be directly attributed to its capacity (via low-degree SoS proofs) to "reason about" analytic properties of probability distributions such as sub-gaussianity, hypercontractivity, and anti-concentration. I will discuss precise formulations of such statements, show how they lead to a principled blueprint for problems in robust statistics including the applications mentioned above, and point out natural gaps in our understanding of analytic properties within SoS, which, if resolved would yield improved guarantees for basic tasks in robust statistics.
Random graphs with latent geometric structure are popular models of social and biological networks, with applications ranging from network user profiling to circuit design. These graphs are also of purely theoretical interest within computer science, probability and statistics. A fundamental initial question regarding these models is: when are these random graphs affected by their latent geometry and when are they indistinguishable from simpler models without latent structure, such as the Erdős-Rényi graph G(n,p)? We address this question for two of the most well-studied models of random graphs with latent geometry -- the random intersection and random geometric graph. Joint work with Matt Brennan and Dheeraj Nagaraj.
In high-dimensional statistical problems (including planted clique, sparse PCA, community detection, etc.), the class of "low-degree polynomial algorithms" captures many leading algorithmic paradigms such as spectral methods, approximate message passing, and local algorithms on sparse graphs. As such, lower bounds against low-degree algorithms constitute concrete evidence for average-case hardness of statistical problems. This method has been widely successful at explaining and predicting statistical-to-computational gaps in these settings. While prior work has understood the power of low-degree algorithms for problems with a "planted" signal, we consider here the setting of "random optimization problems" (with no planted signal), including the problem of finding a large independent set in a random graph, as well as the problem of optimizing the Hamiltonian of mean-field spin glass models. Focusing on the independent set problem, I will define low-degree algorithms in this setting, argue that they capture the best known algorithms, and explain new proof techniques that give sharp lower bounds against low-degree algorithms in this setting. The proof involves a generalization of the so-called "overlap gap property", which is a structural property of the solution space.
Based on arXiv:2004.12063 (joint with David Gamarnik and Aukosh Jagannath) and arXiv:2010.06563
In this talk I will shortly survey recent developments in approximate subgraph counting and sampling in sublinear-time. Both counting and sampling small subgraphs is a basic primitive, well studied both in theory and in practice. We consider these problems in the sublinear-time setting, where access to the graph $G$ is given via queries. We will consider both general graphs, and graphs of bounded arboricity which can be viewed as ``sparse everywhere" graphs, and we will see how we can use this property to obtain substantially faster algorithms.
We present lower-bounds for the generalization error of gradient descent on free initializations, reducing the problem to testing the algorithm’s output under different data models. We then discuss lower-bounds on random initialization and present the problem of learning communities in the pruned-block-model, where it is conjectured that GD fails.
Samples from high-dimensional distributions can be scarce or expensive. Can we meaningfully learn such distributions from one or just a few samples? We provide guarantees for single-sample estimation of Ising models, quantifying the estimation error in terms of the metric entropy of possible interaction matrices. As corollaries of our main result, we derive bounds when the model's interaction matrix is a (sparse) linear combination of known matrices, or it belongs to a finite set, or to a high-dimensional manifold.
Our result handles multiple independent samples by viewing them as one sample from a larger model, and can be used to derive estimation bounds that are qualitatively similar to state-of-the-art in the multiple-sample literature. We thus unify two separate strands of work in the literature: (1) a renaissance of recent work in Computer Science on estimating Ising models/MRFs from multiple independent samples under minimal assumptions about the model's interaction matrix; and (2) a growing literature in Probability Theory on estimating them from one sample in restrictive settings. On the technical front, we exploit novel concentration and anti-concentration inequalities for functions of the Ising model.
We reduce the problem of proving a "Boolean Unique Games Conjecture" (with gap 1-delta vs. 1-C*delta, for any C>1, and sufficiently small delta>0) to the problem of proving a PCP Theorem for a certain non-unique game. In a previous work, Khot and Moshkovitz suggested a candidate reduction (i.e., without a proof of soundness). The current work is the first to provide a reduction along with a proof of soundness. The non-unique game we reduce from is similar to non-unique games for which PCP theorems are known.
Our proof relies on a new, tight, local testing theorem for the low degree part of functions f:R^n-->R in Gaussian space: We show that the low degree part of the restriction of f to a random hyperplane h is extremely close (O(1/n) Euclidean distance) to the restriction of the low degree part of f to h with high probability.
This is joint work with Ronen Eldan from the Weizmann Institute.
Consider the following basic problem in sparse linear regression -- an algorithm gets labeled samples of the form (x, w.x + \eps) where w is an unknown n-dimensional vector, x is drawn from a background n-dimensional distribution D, and \eps is some independent noise. Given the promise that w is k-sparse, the breakthrough work of Candes, Romberg and Tao shows that w can be recovered with a number of samples that scales as O(k log n). This should be contrasted with general linear regression where O(n) samples are information theoretically necessary. We look at this problem from the vantage point of property testing, and study the complexity of determining whether the unknown vector w is k-sparse versus "far" from k-sparse. We show that this decision problem can be solved with a number of samples which is independent of n as long as the background distribution D is i.i.d. and its components are not Gaussian. We further show that weakening any of the conditions in this result necessarily makes the complexity scale logarithmically in n.
Joint work with Xue Chen (George Mason University) and Anindya De (University of Pennsylvania).
When can we test functions more efficiently than we can learn them? We will consider this fundamental problem in the standard PAC learning setting, which corresponds to the sample-based distribution-free property testing model. The VC dimension of a class of functions characterizes the number of samples needed to learn the class. But since some properties can indeed be tested more efficiently than they can be learned, VC dimension by itself does not characterize the number of samples needed to test it. Nonetheless, we will see that VC dimension combined with a closely-related “lower VC dimension" variant does give strong lower bounds on the number of samples required to test many properties of functions, including halfspaces, intersection of halfspaces, polynomial threshold functions, and decision trees.
This is joint work with Renato Ferreira Pinto Jr. and Nathan Harms.