The weekly SILO Seminar Series is made possible through the generous support of the 3M Company and its Advanced Technology Group


with additional support from the Analytics Group of the Northwestern Mutual Life Insurance Company

Northwestern Mutual

From Dirty Data to Structured Prediction

Theo Rekatsinas, UW–Madison

Date and Time: Feb 28, 2018 (12:30 PM)
Location: Orchard room (3280) at the Wisconsin Institute for Discovery Building


The advent of data-hungry applications has enabled computers to interpret what they see, communicate in natural language, and answer complex questions. There is a hidden catch, however: the reliance of all these state-of-the-art systems on high-effort tasks like data preparation and data cleaning. It is estimated that 70% to 80% of the time devoted on analytics projects is spent on checking and organizing data. The challenge is that data collection often introduces dirty data, i.e., incomplete, erroneous, replicated, or conflicting data records.

In this talk, I discuss how to reason about dirty data and demonstrate how statistical learning is the key to managing large volumes of heterogeneous, noisy data sources effectively. I will present HoloClean, our new system that relies on statistical learning and inference to repair identified data errors and anomalies. Finally, I will conclude by drawing connections between data cleaning and structured prediction and how these connections lead to new insights and solutions to classical database problems such as data repairs and consistent query answering.