Toward a System Building Agenda for Data Integration (and Data Science)

AnHai Doan
Vilas Distinguished Achievement Professor of Computer Science
University of Wisconsin-Madison
SERC 306
Wednesday, September 12, 2018 - 11:00
Data integration (DI), broadly interpreted as covering all major data preparation steps such as data extraction, exploration, profiling, cleaning, matching, and merging, is a fundamental challenge in data science. In this talk, I argue that the DI community must devote far more effort to building systems, in order to truly advance the field. I describe a system building agenda that we have been working on in the past three years at Wisconsin. I begin by focusing on entity matching (EM), a major challenge in DI. I describe how we develop cutting-edge EM solutions, and implement them as software packages in the Python ecosystem of data science tools, as well as micro- and macro cloud services that data science teams can easily deploy. I also describe how we build self-service EM solutions, which any lay users can easily use. I discuss the deployment of our EM systems at a Fortunate-500 company and in many industrial and academic projects, and lessons learned. Finally, I discuss how we are applying the same system building ideas to attack problems in schema matching, data browsing, profiling, and cleaning. A key theme underlying many of our solutions is the use of machine learning and user interaction techniques, and a focus on scaling up these techniques to work over very large data as well as over structured and text data.