Sampling Massive Datasets in the Internet

Nicholas Duffield
Rutgers University
Wachman 1015D
Wednesday, February 19, 2014 - 11:00
Massive graph datasets are used operationally by providers of internet, social network and search services. Sampling can reduce storage requirements as well as query execution times, while prolonging the useful life of the data for baselining and retrospective analysis. Sampling must mediate between the characteristics of the data, the available resources, and the accuracy needs of queries. This talk concerns a cost-based formulation to express these opposing priorities, and how this formulation leads to optimal sampling schemes without prior statistical assumptions. The talk concludes with a discussion of open technical problems and potential applications of the methods beyond the Internet.