VideoArxiv

Talk page

Title:

Getting the most from our data

Speaker:

Abstract:

I will discuss techniques, structural results, and open problems from two of my recent papers, in the context of a broader area of work on the motivating question: "how do we get the most from our data?" In the first part of the talk, I will revisit the basic statistical problem of estimating the mean of a real-valued distribution. I will introduce an estimator with the guarantee that "our estimator, on *any* distribution, is as accurate as the sample mean is for the Gaussian distribution of matching variance." Crucially, in contrast to prior works, our estimator does not require prior knowledge of the variance, and works across the entire gamut of distributions with finite variance, including those without any higher moments. Parameterized by the sample size nn, the failure probability δδ, and the variance σ2σ2, our estimator is accurate to within σ⋅(1+o(1))2log1δn‾‾‾‾‾‾√σ⋅(1+o(1))2log⁡1δn, which is tight up to the 1+o(1)1+o(1) factor. In the second part of the talk I will introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. As a motivating example: suppose one wishes to conduct an accurate political poll of a population, estimating the fraction of the population that would answer "yes" to a certain question; however, one does not have access to uniformly random people from the population, but instead only has access to a set of people S drawn from some distribution D, a distribution on subsets of the population. For example, D could describe a viral process on a social network, where respondents are the - highly correlated - set of people that the viral process has reached by a given time. How can knowledge of the distribution D, combined with the actual sampled set S and their survey responses, best be used to estimate the true population mean? Our model encompasses the fact that people's survey responses may be intricately correlated with their probability of being in S. We present an efficient π/2π/2 approximation to this general class of questions, via a surprising connection to the Grothendieck problem.

Link:

https://www.ias.edu/video/getting-most-our-data