The weekly SILO Seminar Series is made possible through the generous support of the 3M Company and its Advanced Technology Group


Entity Matching Meets Data Science: A Progress Report from the Magellan Project

AnHai Doan,

Date and Time: Feb 06, 2019 (12:30 PM)
Location: Orchard room (3280) at the Wisconsin Institute for Discovery Building


Entity matching (EM) finds data instances that refer to the
same real-world entity. In 2015, we started the Magellan project at UW-Madison to
build EM systems. Most current EM systems are stand-alone
monoliths. In contrast, Magellan borrows ideas from the field
of data science (DS), to build a novel kind of EM systems,
which is an ecosystem of interoperable tools. These tools often exploit
machine learning, user interaction, and big data scaling techniques.

This talk provides a progress report on the past 3.5 years
of Magellan, focusing on the system aspects and on how ideas
from the field of data science have been adapted to the EM
context. We begin by arguing why EM can be viewed as a
special class of DS problems, and thus can benefit from system
building ideas in DS. We discuss how these ideas have
been adapted to build PyMatcher and CloudMatcher, EM
tools for power users and lay users. These tools have been
successfully used in 22 EM tasks at 13 companies and domain
science groups, and have been pushed into production for
many customers. We report on the lessons learned, and outline a new
envisioned Magellan ecosystem, which consists of
not just on-premise Python tools, but also interoperable microservices deployed,
executed, and scaled out on the cloud, using tools such as Dockers and Kubernetes.