Claude Shannon

Wednesday, March 23rd, 2015
Register Here

Room 202 in Packard Bldg., Stanford University
Parking Generally Free In Nearby Lots After 4:00 pm

Refreshments and Conversation at 6:00 P.M.
Presentation at 6:30 P.M.

When Your Big Data Seems Too Small:
Accurate inferences beyond the empirical distribution.

Prof. Gregory Valiant
Department of Computer Science, Stanford
     Co-sponsored with the IEEE Signal Processing Society Santa Clara Valley Chapter


We discuss two problems related to the general challenge of making accurate inferences about a complex distribution, in the regime in which the amount of data (i.e the sample size) is too small for the empirical distribution of the samples to be an accurate representation of the underlying distribution. The first problem is the basic task of inferring properties of a discrete distribution, given access to independent draws. We show that one can accurately recover the unlabelled vector of probabilities of all domain elements whose true probability is greater than 1/(n log n). Stated differently, one can learn--up to relabelling--the portion of the distribution consisting of elements with probability greater than 1/(n log n). This result has several curious implications, including leading to an optimal algorithm for "de-noising" the empirical distribution of the samples, and implying that one can accurately estimate the number of new domain elements that would be seen given a new larger sample, of size up to n* log n. (Extrapolation beyond this sample size is provable information theoretically impossible, without additional assumptions on the distribution.) While these results are applicable generally, we highlight an adaptation of this general approach to some problems in genomics (e.g. quantifying the number of unobserved protein coding variants).

The second problem we consider is the task of accurately estimating the eigenvalues of the covariance matrix of a (high dimensional real-valued) distribution--the "population spectrum". (These eigenvalues contain basic information about the distribution, including the presence or lack of low- dimensional structure in the distribution and the applicability of many higher- level machine learning and multivariate statistical tools.) As we show, even in the regime where the sample size is linear or sublinear in the dimensionality of the distribution, and hence the eigenvalues and eigenvectors of the empirical covariance matrix are misleading, accurate approximations to the true population spectrum are possible.

This talk is based on three papers, which are joint works with Paul Valiant,James Zou, and Weihao Kong.


Photo of Prof. Gregory Valiant Prof. Gregory Valiant is an Assistant Professor in Stanford's Computer Science Department. His main research interests are in algorithms, learning, applied probability, and statistics; he is also interested in evolution and game theory, and has enjoyed working on problems in database theory. Prior to joining Stanford, he was a postdoc at Microsoft Research, New England. He received his PhD from Berkeley in Computer Science, and BA in Math from Harvard.



SCV IT Society Webmaster (
Last updated on

Return to SCV IT Society Homepage