Misha Belkin (Ohio State),
Date and Time: Nov 14, 2018 (12:30 PM)
Location: Orchard room (3280) at the Wisconsin Institute for Discovery Building
A striking feature of modern supervised machine learning is its
consistent use of techniques that
interpolate the data. Deep networks, often containing several orders
of magnitude more
parameters than data points, are trained to obtain near zero error on
the training set.
Yet, at odds with most theory, they show excellent test performance.
In this talk I will discuss and give some historical context for the
phenomenon of interpolation
(zero training loss). I will show how it provides a new perspective on
machine learning forcing us to
rethink some commonly held assumptions and points to significant gaps
in our understanding, even in the simplest settings,
of when classifiers generalize. I will outline some first theoretical
results in that direction, showing that such classifiers can indeed
be statistically consistent and even optimal.
In the second part of the talk I will point to the computational the
power of interpolation by describing how it results
in very efficient optimization of over-parametrized models using
Stochastic Gradient Descent. Furthermore, I will show how the
simplicity of the setting can be harnessed to construct very fast and
theoretically sound methods for training large-scale kernels.
I will also briefly describe some new accelerated SGD methods for