Wednesday, March 23^{rd}, 2015
Register Here
Room 202 in Packard Bldg., Stanford University
Parking Generally Free In Nearby Lots After 4:00 pm
Map
Refreshments and Conversation at 6:00 P.M.
Presentation at 6:30 P.M.
When Your Big Data Seems Too Small: Accurate inferences beyond the empirical distribution.
Abstract
We discuss two problems related to the general challenge of making accurate
inferences about a complex distribution, in the regime in which the amount
of data (i.e the sample size) is too small for the empirical distribution of
the samples to be an accurate representation of the underlying distribution.
The first problem is the basic task of inferring properties of a discrete
distribution, given access to independent draws. We show that one can
accurately recover the unlabelled vector of probabilities of all domain
elements whose true probability is greater than 1/(n log n). Stated
differently, one can learnup to relabellingthe portion of the
distribution consisting of elements with probability greater than 1/(n log
n). This result has several curious implications, including leading to an
optimal algorithm for "denoising" the empirical distribution of the
samples, and implying that one can accurately estimate the number of new
domain elements that would be seen given a new larger sample, of size up to
n* log n. (Extrapolation beyond this sample size is provable information
theoretically impossible, without additional assumptions on the
distribution.) While these results are applicable generally, we highlight
an adaptation of this general approach to some problems in genomics (e.g.
quantifying the number of unobserved protein coding variants).
The second problem we consider is the task of accurately estimating the
eigenvalues of the covariance matrix of a (high dimensional realvalued)
distributionthe "population spectrum". (These eigenvalues contain basic
information about the distribution, including the presence or lack of low
dimensional structure in the distribution and the applicability of many higher
level machine learning and multivariate statistical tools.) As we show, even in
the regime where the sample size is linear or sublinear in the dimensionality of
the distribution, and hence the eigenvalues and eigenvectors of the empirical
covariance matrix are misleading, accurate approximations to the true population
spectrum are possible.
This talk is based on three papers, which are joint works with Paul Valiant,James Zou, and Weihao Kong.
