What is the best we can do with the amount of data at our disposal with a given learning task? Modern learning problems---with a modest amount of data or subject to data processing constraints---frequently raise the need to understand the fundamental limits and make judicious use of the available small or imperfect data. This talk will cover several examples of learning where exploiting the key structure, as well as optimally trading between real-world resources, are vital to achieve statistical efficiency.

Current methods for causal discovery typically report a single directed acyclic graph (DAG). Through an example, I hope to convince you that this might not be the best practice. In fact, depending on how two DAGs intersect and the local geometry at the intersection, the hardness of this problem can vary dramatically.

Even the most carefully curated economic data sets have variables that are noisy, missing, discretized, or privatized. The standard workflow for empirical research involves data cleaning followed by data analysis that typically ignores the bias and variance consequences of data cleaning. We formulate a semiparametric model for causal inference with corrupted data to encompass both data cleaning and data analysis. We propose a new end-to-end procedure for data cleaning, estimation, and inference with data cleaning-adjusted confidence intervals. We prove consistency, Gaussian approximation, and semiparametric efficiency for our estimator of the causal parameter by finite sample arguments. The rate of Gaussian approximation is $n^{-1/2}$ for global parameters such as average treatment effect, and it degrades gracefully for local parameters such as heterogeneous treatment effect for a specific demographic. Our key assumption is that the true covariates are approximately low rank. In our analysis, we provide nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics. We verify the coverage of the data cleaning-adjusted confidence intervals in simulations calibrated to resemble differential privacy as implemented in the 2020 US Census.

I will provide a brief overview of previous work on credible inference in the context of causal inference and statistical machine learning, and discuss ongoing directions on interfaces of causal inference embedded in complex operational systems.

Today's data-rich platforms are reshaping the operations of societal networks by providing information, recommendations, and matching services to a large number of users. How can we model the behavior of human agents in response to services provided by these platforms, and develop tools to improve the aggregate outcomes in a socially desirable manner? In this talk, I will briefly summarize our works that tackle this question from three aspects: 1) Game-theoretic analysis of the impact of information platforms (navigation apps) on the strategic behavior and learning processes of travelers in uncertain networks; 2) Market mechanism design for efficient carpooling and toll pricing in the presence of autonomous driving technology; 3) Security analysis and resource allocation for robustness under random or adversarial disruptions.

How do we make machine learning (ML) algorithms fair and reliable? This is particularly important today as ML enters high-stakes applications such as hiring and education, often adversely affecting people's lives with respect to gender, race, etc., and also violating anti-discrimination laws. When it comes to resolving legal disputes or even informing policies and interventions, only identifying bias/disparity in a model's decision is insufficient. We really need to dig deeper into how it arose. E.g., disparities in hiring that can be explained by an occupational necessity (code-writing skills for software engineering) may be exempt by law, but the disparity arising due to an aptitude test may not be (Ref: Griggs v. Duke Power `71). This leads us to a question that bridges the fields of fairness, explainability, and law: How can we identify and explain the sources of disparity in ML models, e.g., did the disparity entirely arise due to the critical occupational necessities? In this talk, I propose a systematic measure of "non-exempt disparity," i.e., the bias which cannot be explained by the occupational necessities. To arrive at a measure for the non-exempt disparity, I adopt a rigorous axiomatic approach that brings together concepts in information theory (in particular, an emerging body of work called Partial Information Decomposition) with causality.

Machine learning is increasingly being used for mechanism design, with applications such as price optimization on online marketplaces and ad auction design. In this talk, I will give an overview of my research on mechanism design via machine learning, touching on statistical problems such as overfitting, incentive problems, and privacy preservation.

Given i.i.d samples from an unknown distribution, estimating its symmetric properties is a classical problem in information theory, statistics and computer science. Symmetric properties are those that are invariant to label permutations and include popular functionals such as entropy and support size. Early work on this question dates back to the 1940s when R. A. Fisher and A. S. Corbet studied this to estimate the number of distinct butterfly species in Malaysia. Over the past decade, this question has received great attention leading to computationally efficient and sample optimal estimators for various symmetric properties. All these estimators were property specific and the design of a single estimator that is sample optimal for any symmetric property remained a central open problem in the area. In a recent breakthrough, Acharya et. al. showed that computing an approximate profile maximum likelihood (PML), a distribution that maximizes the likelihood of the observed multiset of frequencies, allows statistically optimal estimation of any symmetric property. However, since its introduction by Orlitsky et. al. in 2004, efficient computation of an approximate PML remained a well known open problem. In our work, we resolved this question by designing the first efficient algorithm for computing an approximate PML distribution. In addition, our investigations have led to a deeper understanding of various computational and statistical aspects of PML and universal estimators.

Motivated by recent advances in both theoretical and applied aspects of multiplayer games, spanning from e-sports to multi-agent generative adversarial networks, we focus on min-max optimization in team zero-sum games. In this class of games, players are split into two teams with payoffs equal within the same team and of opposite sign across the opponent team. Unlike the textbook two-player zero-sum games, finding a Nash equilibrium in our class can be shown to be CLS-hard, i.e., it is unlikely to have a polynomial-time algorithm for computing Nash equilibria. Moreover, in this generalized framework, we establish that even asymptotic last iterate or time average convergence to a Nash Equilibrium is not possible using Gradient Descent Ascent (GDA), its optimistic variant, and extra gradient. Specifically, we present a family of team games whose induced utility is \emph{non} multi-linear with \emph{non} attractive \emph{per-se} mixed Nash Equilibria, as strict saddle points of the underlying optimization landscape. Leveraging techniques from control theory, we complement these negative results by designing a modified GDA that converges locally to Nash equilibria. Finally, we discuss connections of our framework with AI architectures with team competition structures like multi-agent generative adversarial networks.

What will happen to Y if we do A? A variety of meaningful socio-economic and engineering questions can be formulated this way. To name a few: What will happen to a patient's health if they are given a new therapy? What will happen to a country's economy if policy-makers legislate a new tax? What will happen to a data center's latency if a new congestion control protocol is used? In this talk, we will explore how to answer such counterfactual questions using observational data---which is increasingly available due to digitization and pervasive sensors---and/or very limited experimental data. The two key challenges in doing so are: (i) counterfactual prediction in the presence of latent confounders; (ii) estimation with modern datasets which are high-dimensional, noisy, and sparse. Towards this goal, the key framework we introduce is connecting causal inference with tensor completion, a very active area of research across a variety of fields. In particular, we show how to represent the various potential outcomes (i.e., counterfactuals) of interest through an order-3 tensor. The key theoretical results presented are: (i) Formal identification results establishing under what missingness patterns, latent confounding, and structure on the tensor is recovery of unobserved potential outcomes possible. (ii) Introducing novel estimators to recover these unobserved potential outcomes and proving they are finite-sample consistent and asymptotically normal. The efficacy of the proposed estimators is shown on high-impact real-world applications. These include working with: (i) TaurRx Therapeutics to propose novel clinical trial designs to reduce the number of patients recruited for a trial and to correct for bias from patient dropouts; (ii) Uber Technologies on evaluating the impact of certain driver engagement policies without having to run an A/B test.

The ability to learn from data and make decisions in real-time has led to the rapid deployment of machine learning algorithms across many aspects of everyday life. While this has enabled new services and technologies, the fact that algorithms are increasingly interacting with people and other algorithms marks a distinct shift away from the traditional machine learning paradigm. Indeed, little is known about how these algorithms--- that were designed to operate in isolation--- behave when confronted with strategic behaviors on the part of people, and the extent to which strategic agents can game the algorithms to achieve better outcomes. In this talk, I will give an overview of my work on learning games and in the presence of strategic agents and multi-agent reinforcement learning.

Data-driven design is making headway into a number of application areas, including protein, small-molecule, and materials engineering. The design goal is to construct an object with desired properties, such as a protein that binds to a target more tightly than previously observed. To that end, costly experimental measurements are being replaced with calls to a high-capacity regression model trained on labeled data, which can be leveraged in an in silico search for promising design candidates. The aim then is to discover designs that are better than the best design in the observed data. This goal puts machine-learning based design in a much more difficult spot than traditional applications of predictive modelling, since successful design requires, by definition, some degree of extrapolation---a pushing of the predictive models to its unknown limits, in parts of the design space that are a priori unknown. In this talk, I will discuss our methodological approaches to this problem, as well as report on some recent success in designing gene therapy delivery (AAV) libraries, useful for general downstream directed evolution selections.