GS 4 Day 3 (9-April-2021)

4.1 ML foundations by Ravi kumar 9 AM

Ravi talks about a generic framework for online learning. A simple problem to model in this framework can be taken as a ski-rental problem where renting cost is 1$, but ski cost is 25$. The algorithm tries to give theoretical bounds on the maximum cost involved before making a correct decision. Ravi also emphasizes that online learning is pessimistic opposite to machine learning which tries to learn average behavior.

The key idea is to incorporate ‘hints’ at each time step before making a decision. The research is trying to achieve close to offline level performance when hints are good and ensure to perform as good as using no hints in the worst case when all the hints are bad.

Ravi starts with a basic formulation of the problem and builds upon a story on how research has evolved at each stage to solve previous solutions’ issues. Worst case cost bounds are derived for each case by various researchers. In the end, the last case is covered where multiple hints are incorporated at a single time step.

A minimal set of experiments were performed on the ResNet classifier, where a hint is previous gradient. I feel that we may try to model our online prediction models this way.

4.2 ML foundations: Stochastic Gradient Descent (Theory v/s practice) by Prateek Jain, 10:30 AM

First, Prateek shows how different the SGD is in practice than in theory in terms of convexity, non-i.i.d selection of samples, batch size, step size, etc. Then the discussion moves on to how fast gradient descent can find a p-order stationary point. It turns out that for \(p\ge4\), finding stationary points is an NP-hard problem. Somehow, gradient descent can still find local optima, but the rates are unknown.

Gradient descent can find the first-order stationary points. Computing hessian is expensive, and so is finding second-order stationary points. But, there is a simple algorithm to find second-order stationary points with gradients alone. It is noisy GD where you add an isotropic noise to the solution and re-do GD. Practically, it is believed that SGD is equivalent to this, and it finds second-order stationary points with higher probability. It is an open problem to show this concretely.

Prateek concludes the talk by mentioning that if one analyses SGD, it might be possible to overcome the current problems and bring the convergence rate further down.