Enabling High-performance Sampling for Big Data Processing

Jun Wang
Professor of Computer Engineering and Director of the Computer Architecture and Storage Systems (CASS) Laboratory
University of Central Florida
SERC 306
Tuesday, October 22, 2019 - 11:00
In this talk, we aim to demonstrate how to perform sampling in today’s big data processing platforms. We enable both efficient and accurate approximations on arbitrary sub-datasets of a large dataset. Due to the prohibitive storage overhead of caching offline samples for each sub-dataset, existing offline sample based systems provide high accuracy results for only a limited number of sub-datasets, such as the popular ones. On the other hand, current online sample based approximation systems, which generate samples at runtime, do not take into account the uneven storage distribution of a sub-dataset. They work well for uniform distribution of a sub-dataset while suffer low sampling efficiency and poor estimation accuracy on unevenly distributed sub-datasets.

To address the problem, we develop a distribution aware method called Sapprox. Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system and make good use of such information to facilitate online sampling. We have implemented Sapprox into Hadoop ecosystem as an example system and open sourced it on GitHub. Our comprehensive experimental results show that Sapprox can achieve a speedup by up to a factor of 20 over the precise execution.