SILO



The weekly SILO Seminar Series is made possible through the generous support of the 3M Company and its Advanced Technology Group

3M

with additional support from the Analytics Group of the Northwestern Mutual Life Insurance Company

Northwestern Mutual

Similarity comparisons and non-metric clustering

Jordan Ellenberg, Professor of Mathematics

Date and Time: Jan 26, 2011 (12:30 PM)
Location: Orchard room (3280) at the Wisconsin Institute for Discovery Building

Abstract:

Suppose given a set S of objects, together with some measure of dissimilarity between pairs of objects in S. It is an old and much-studied problem to find good embeddings of S into standard metric spaces (like the real plane, or a binary tree) so that the dissimilarities between objects in S are roughly similar to the distances between the corresponding points in the metric space. If you have a real-valued dissimilarity between every pair of objects in S, and the target is Euclidean space, the standard approach is multi-dimensional scaling. In practice, one might have dissimilarity measurements only for some sparse set of pairs -- or (the non-metric case) one might not have real-valued dissimilarities at all, but only an ordinal ranking of similarities ("A is more like B than C is like D.")

I will discuss work of Lanckriet-McFee and Dasarathy-Eriksson-Novak-Singh on clustering given partial ordinal data on similarities; it turns out that in this case one can still do a good job of building a heirarchical clustering on S (that is, embedding S in the leaf-set of a tree.) We discuss some questions about how one might go further, including: can we use these methods to efficiently evade calibration problems ("one rater's "very similar" is another rater's "somewhat similar"?) How do we choose what kind of target metric space to try to embed S in? What is the non-metric version of the Netflix problem and how might one attack it? Is it correct to talk about "compressed clustering"?

Please be warned that the talk will contain no new theorems or data, and that the speaker is wandering dangerously far from his own area of expertise.