Lifecycle Management of Collaborative Data Science Workflows

Amol Deshpande
Professor of Computer Science
University of Maryland
SERC 306
Friday, December 1, 2017 - 11:00
For several decades now, the amount of data available to us has been growing at a pace far higher than our ability to process it; this trend has accelerated many-fold in recent years with the emergence of efficient and mass-produced scientific instruments, increasing ease of generating and publishing data, and proliferation of Internet-connected devices. In this talk I will present an overview of our ongoing work on building a platform for enabling collaborative data science, where teams of data scientists can simultaneously analyze, modify, and share datasets, to understand trends and to extract actionable insights. , While numerous solutions exist for specific data analysis tasks, underlying infrastructure and data management capabilities for supporting ad hoc collaboration pipelines are still largely missing. I will present our vision for a unified, dataset-centric platform for addressing these challenges, and present our recent work on: (a) efficiently managing a large number of versioned datasets, (b) designing and supporting a unified query language to seamlessly query versioning and provenance information, and (c) lifecycle management of complex machine learning models like deep neural networks.