Computer Science

Displaying 241 - 252 of 384

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks in High Dimension

Wei Hu (Princeton University)
December 16, 2020

16877
Computer Science
Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent

Nadav Cohen (Tel-Aviv University)
December 16, 2020

16875
Computer Science
SGD Learns One-Layer Networks in WGANs

Qi Lei (Princeton University)
December 16, 2020

16874
Computer Science
Recent Progress in Algorithmic Robust Statistics via the Sum-of-Squares Method

Pravesh Kothari (CMU)
December 15, 2020

16921
Computer Science
Phase Transitions for Detecting Latent Geometry in Random Graphs

Dheeraj Nagaraj (MIT)
December 15, 2020

16873
Computer Science
Low-Degree Hardness of Random Optimization Problems

Alex Wein (New York University)
December 15, 2020

16866
Computer Science
Counting and Sampling Subgraphs in Sublinear Time

Talya Eden (MIT)
December 15, 2020

16864
Computer Science
Learning and Testing for Gradient Descent

Emmanuel Abbe (EPFL)
December 15, 2020

16863
Computer Science
Learning Ising Models from One, Ten or a Thousand Samples

Costis Daskalakis (MIT)
December 14, 2020

16942
Computer Science
Reduction From Non-Unique Games to Boolean Unique Games

Dana Moshkovitz (University of Texas, Austin)
December 14, 2020

16862
Computer Science
Testing Noisy Linear Functions for Sparsity

Rocco Servedio (Columbia University)
December 14, 2020

16860
Computer Science
VC Dimension and Distribution-Free Sample-Based Testing

Eric Blais (University of Waterloo)
December 14, 2020

16858
Computer Science

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks in High Dimension

Wei Hu (Princeton University)
December 16, 2020

16877
Computer Science
Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove that, for a class of well-behaved input distributions in high dimension, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs. We additionally argue that this surprising simplicity can persist in networks with more layers and with convolutional architecture, which we verify empirically. Key to our analysis is to bound the spectral norm of the difference between the Neural Tangent Kernel (NTK) at initialization and an affine transform of the data kernel; however, unlike many previous results utilizing the NTK, we do not require the network to have disproportionately large width, and the network is allowed to escape the kernel regime later in training. Link to paper: https://arxiv.org/abs/2006.14599
Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent

Nadav Cohen (Tel-Aviv University)
December 16, 2020

16875
Computer Science
Understanding deep learning calls for addressing the questions of: (i) optimization --- the effectiveness of simple gradient-based algorithms in solving neural network training programs that are non-convex and thus seemingly difficult; and (ii) generalization --- the phenomenon of deep learning models not overfitting despite having many more parameters than examples to learn from. Existing analyses of optimization and/or generalization typically adopt the language of classical learning theory, abstracting away many details on the setting at hand. In this talk I will argue that a more refined perspective is in order, one that accounts for the dynamics of the optimizer. I will then demonstrate a manifestation of this approach, analyzing the dynamics of gradient descent over linear neural networks. We will derive what is, to the best of my knowledge, the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that sometimes, adding (redundant) linear layers to a classic linear model significantly accelerates gradient descent, despite the introduction of non-convexity. Finally, we will show that such addition of layers induces an implicit bias towards low rank (different from any type of norm regularization), and by this explain generalization of deep linear neural networks for the classic problem of low rank matrix completion. Works covered in this talk were in collaboration with Sanjeev Arora, Noah Golowich, Elad Hazan, Wei Hu, Yuping Luo and Noam Razin.
SGD Learns One-Layer Networks in WGANs

Qi Lei (Princeton University)
December 16, 2020

16874
Computer Science
Generative adversarial networks (GANs) are a widely used framework for learning generative models. Wasserstein GANs (WGANs), one of the most successful variants of GANs, require solving a min-max optimization problem to global optimality but are in practice successfully trained using stochastic gradient descent-ascent. In this talk, we show that, when the generator is a one-layer network, stochastic gradient descent-ascent converges to a global solution with polynomial time and sample complexity.
Recent Progress in Algorithmic Robust Statistics via the Sum-of-Squares Method

Pravesh Kothari (CMU)
December 15, 2020

16921
Computer Science
Past five years have witnessed a sequence of successes in designing efficient algorithms for statistical estimation tasks when the input data is corrupted with a constant fraction of fully malicious outliers. The Sum-of-Squares (SoS) method has been an integral part of this story and is behind robust learning algorithms for tasks such as estimating the mean, covariance, and higher moment tensors of a broad class of distributions, clustering and parameter estimation for spherical and non-spherical mixture models, linear regression, and list-decodable learning. In this talk, I will attempt to demystify this (unreasonable?) effectiveness of the SoS method in robust statistics. I will argue that the utility of the SoS algorithm in robust statistics can be directly attributed to its capacity (via low-degree SoS proofs) to "reason about" analytic properties of probability distributions such as sub-gaussianity, hypercontractivity, and anti-concentration. I will discuss precise formulations of such statements, show how they lead to a principled blueprint for problems in robust statistics including the applications mentioned above, and point out natural gaps in our understanding of analytic properties within SoS, which, if resolved would yield improved guarantees for basic tasks in robust statistics.
Phase Transitions for Detecting Latent Geometry in Random Graphs

Dheeraj Nagaraj (MIT)
December 15, 2020

16873
Computer Science
Random graphs with latent geometric structure are popular models of social and biological networks, with applications ranging from network user profiling to circuit design. These graphs are also of purely theoretical interest within computer science, probability and statistics. A fundamental initial question regarding these models is: when are these random graphs affected by their latent geometry and when are they indistinguishable from simpler models without latent structure, such as the Erdős-Rényi graph G(n,p)? We address this question for two of the most well-studied models of random graphs with latent geometry -- the random intersection and random geometric graph. Joint work with Matt Brennan and Dheeraj Nagaraj.
Low-Degree Hardness of Random Optimization Problems

Alex Wein (New York University)
December 15, 2020

16866
Computer Science
In high-dimensional statistical problems (including planted clique, sparse PCA, community detection, etc.), the class of "low-degree polynomial algorithms" captures many leading algorithmic paradigms such as spectral methods, approximate message passing, and local algorithms on sparse graphs. As such, lower bounds against low-degree algorithms constitute concrete evidence for average-case hardness of statistical problems. This method has been widely successful at explaining and predicting statistical-to-computational gaps in these settings. While prior work has understood the power of low-degree algorithms for problems with a "planted" signal, we consider here the setting of "random optimization problems" (with no planted signal), including the problem of finding a large independent set in a random graph, as well as the problem of optimizing the Hamiltonian of mean-field spin glass models. Focusing on the independent set problem, I will define low-degree algorithms in this setting, argue that they capture the best known algorithms, and explain new proof techniques that give sharp lower bounds against low-degree algorithms in this setting. The proof involves a generalization of the so-called "overlap gap property", which is a structural property of the solution space. Based on arXiv:2004.12063 (joint with David Gamarnik and Aukosh Jagannath) and arXiv:2010.06563
Counting and Sampling Subgraphs in Sublinear Time

Talya Eden (MIT)
December 15, 2020

16864
Computer Science
In this talk I will shortly survey recent developments in approximate subgraph counting and sampling in sublinear-time. Both counting and sampling small subgraphs is a basic primitive, well studied both in theory and in practice. We consider these problems in the sublinear-time setting, where access to the graph $G$ is given via queries. We will consider both general graphs, and graphs of bounded arboricity which can be viewed as ``sparse everywhere" graphs, and we will see how we can use this property to obtain substantially faster algorithms.
Learning and Testing for Gradient Descent

Emmanuel Abbe (EPFL)
December 15, 2020

16863
Computer Science
We present lower-bounds for the generalization error of gradient descent on free initializations, reducing the problem to testing the algorithm’s output under different data models. We then discuss lower-bounds on random initialization and present the problem of learning communities in the pruned-block-model, where it is conjectured that GD fails.
Learning Ising Models from One, Ten or a Thousand Samples

Costis Daskalakis (MIT)
December 14, 2020

16942
Computer Science
Samples from high-dimensional distributions can be scarce or expensive. Can we meaningfully learn such distributions from one or just a few samples? We provide guarantees for single-sample estimation of Ising models, quantifying the estimation error in terms of the metric entropy of possible interaction matrices. As corollaries of our main result, we derive bounds when the model's interaction matrix is a (sparse) linear combination of known matrices, or it belongs to a finite set, or to a high-dimensional manifold. Our result handles multiple independent samples by viewing them as one sample from a larger model, and can be used to derive estimation bounds that are qualitatively similar to state-of-the-art in the multiple-sample literature. We thus unify two separate strands of work in the literature: (1) a renaissance of recent work in Computer Science on estimating Ising models/MRFs from multiple independent samples under minimal assumptions about the model's interaction matrix; and (2) a growing literature in Probability Theory on estimating them from one sample in restrictive settings. On the technical front, we exploit novel concentration and anti-concentration inequalities for functions of the Ising model.
Reduction From Non-Unique Games to Boolean Unique Games

Dana Moshkovitz (University of Texas, Austin)
December 14, 2020

16862
Computer Science
We reduce the problem of proving a "Boolean Unique Games Conjecture" (with gap 1-delta vs. 1-C*delta, for any C>1, and sufficiently small delta>0) to the problem of proving a PCP Theorem for a certain non-unique game. In a previous work, Khot and Moshkovitz suggested a candidate reduction (i.e., without a proof of soundness). The current work is the first to provide a reduction along with a proof of soundness. The non-unique game we reduce from is similar to non-unique games for which PCP theorems are known. Our proof relies on a new, tight, local testing theorem for the low degree part of functions f:R^n-->R in Gaussian space: We show that the low degree part of the restriction of f to a random hyperplane h is extremely close (O(1/n) Euclidean distance) to the restriction of the low degree part of f to h with high probability. This is joint work with Ronen Eldan from the Weizmann Institute.
Testing Noisy Linear Functions for Sparsity

Rocco Servedio (Columbia University)
December 14, 2020

16860
Computer Science
Consider the following basic problem in sparse linear regression -- an algorithm gets labeled samples of the form (x, w.x + \eps) where w is an unknown n-dimensional vector, x is drawn from a background n-dimensional distribution D, and \eps is some independent noise. Given the promise that w is k-sparse, the breakthrough work of Candes, Romberg and Tao shows that w can be recovered with a number of samples that scales as O(k log n). This should be contrasted with general linear regression where O(n) samples are information theoretically necessary. We look at this problem from the vantage point of property testing, and study the complexity of determining whether the unknown vector w is k-sparse versus "far" from k-sparse. We show that this decision problem can be solved with a number of samples which is independent of n as long as the background distribution D is i.i.d. and its components are not Gaussian. We further show that weakening any of the conditions in this result necessarily makes the complexity scale logarithmically in n. Joint work with Xue Chen (George Mason University) and Anindya De (University of Pennsylvania).
VC Dimension and Distribution-Free Sample-Based Testing

Eric Blais (University of Waterloo)
December 14, 2020

16858
Computer Science
When can we test functions more efficiently than we can learn them? We will consider this fundamental problem in the standard PAC learning setting, which corresponds to the sample-based distribution-free property testing model. The VC dimension of a class of functions characterizes the number of samples needed to learn the class. But since some properties can indeed be tested more efficiently than they can be learned, VC dimension by itself does not characterize the number of samples needed to test it. Nonetheless, we will see that VC dimension combined with a closely-related “lower VC dimension" variant does give strong lower bounds on the number of samples required to test many properties of functions, including halfspaces, intersection of halfspaces, polynomial threshold functions, and decision trees. This is joint work with Renato Ferreira Pinto Jr. and Nathan Harms.

Title	Speaker(s)	Date	Info link
The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks in High Dimension	Wei Hu (Princeton University)	2020-12-16	View details
Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent	Nadav Cohen (Tel-Aviv University)	2020-12-16	View details
SGD Learns One-Layer Networks in WGANs	Qi Lei (Princeton University)	2020-12-16	View details
Recent Progress in Algorithmic Robust Statistics via the Sum-of-Squares Method	Pravesh Kothari (CMU)	2020-12-15	View details
Phase Transitions for Detecting Latent Geometry in Random Graphs	Dheeraj Nagaraj (MIT)	2020-12-15	View details
Low-Degree Hardness of Random Optimization Problems	Alex Wein (New York University)	2020-12-15	View details
Counting and Sampling Subgraphs in Sublinear Time	Talya Eden (MIT)	2020-12-15	View details
Learning and Testing for Gradient Descent	Emmanuel Abbe (EPFL)	2020-12-15	View details
Learning Ising Models from One, Ten or a Thousand Samples	Costis Daskalakis (MIT)	2020-12-14	View details
Reduction From Non-Unique Games to Boolean Unique Games	Dana Moshkovitz (University of Texas, Austin)	2020-12-14	View details
Testing Noisy Linear Functions for Sparsity	Rocco Servedio (Columbia University)	2020-12-14	View details
VC Dimension and Distribution-Free Sample-Based Testing	Eric Blais (University of Waterloo)	2020-12-14	View details

Supported by

Format results

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks in High Dimension

Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent

SGD Learns One-Layer Networks in WGANs

Recent Progress in Algorithmic Robust Statistics via the Sum-of-Squares Method

Phase Transitions for Detecting Latent Geometry in Random Graphs

Low-Degree Hardness of Random Optimization Problems

Counting and Sampling Subgraphs in Sublinear Time

Learning and Testing for Gradient Descent

Learning Ising Models from One, Ten or a Thousand Samples

Reduction From Non-Unique Games to Boolean Unique Games

Testing Noisy Linear Functions for Sparsity

VC Dimension and Distribution-Free Sample-Based Testing

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks in High Dimension

Analyzing Optimization and Generalization in Deep Learning via Dynamics of Gradient Descent

SGD Learns One-Layer Networks in WGANs

Recent Progress in Algorithmic Robust Statistics via the Sum-of-Squares Method

Phase Transitions for Detecting Latent Geometry in Random Graphs

Low-Degree Hardness of Random Optimization Problems

Counting and Sampling Subgraphs in Sublinear Time

Learning and Testing for Gradient Descent

Learning Ising Models from One, Ten or a Thousand Samples

Reduction From Non-Unique Games to Boolean Unique Games

Testing Noisy Linear Functions for Sparsity

VC Dimension and Distribution-Free Sample-Based Testing