The Data Civilizer – Easing the Pain of Data Scientists

Mourad Ouzzani
Principal Scientist with the Qatar Computing Research Institute
Hamad Bin Khalifa University, Qatar Foundation
SERC 306
Thursday, October 11, 2018 - 11:00
Enterprise data is usually scattered across departments and geographic regions, and it is often dirty and inconsistent. Data scientists spend most of their time finding, preparing, integrating, and cleaning relevant datasets. I will describe our current efforts to ease the pain of data scientists in our Data Civilizer project. Key components of Data Civilizer include data discovery, data cleaning, data transformation, and entity resolution and consolidation. I'll first briefly explain how through data profiling, indexing, and semantic elicitation, we built a data discovery component. I'll then talk about the deep learning-based entity resolution component. Finally, I'll describe how to detect a special type of data errors, namely disguised missing values, which turned out to be quite frequent in various proprietary and open data.