How does SGD escape local minima?

Question

SGD is able to jump out of local minima that would otherwise trap BGD

I don't really understand the above statement. Could someone please provide a mathematical explanation for why SGD (Stochastic Gradient Descent) is able to escape local minima, while BGD (Batch Gradient Descent) can't?

P.S.

While searching online, I read that it has something to do with "oscillations" while taking steps towards the global minima. What's that?

This is a popular question. You can find a good explanation in the answer here. — Kaivalya Swami, May 31 '20 at 10:18
The statement above is from a MOOC on Coursera, while I found more details here - http://proceedings.mlr.press/v80/kleinberg18a/kleinberg18a.pdf — stoic-santiago, Jun 02 '20 at 11:08
@cogito_ai I wouldn't say that SGD can't really escape local minima, but that it can get stuck in other (better?) minima. It's not like simulated annealing, where sometimes you perform a random action to escape local minima. However, it's true that there's some form of stochasticity in SGD that probably helps to explore more the search space. And that's what people probably mean by that sentence. See also https://ai.stackexchange.com/a/20380/2444. — nbro, Jun 02 '20 at 15:38
Anyway, can you please provide the link to the exact Coursera's lecture that claims that? Also, if you want, later, I could read that paper and try to provide an answer based on it. And don't forget to tag me with @nbro next time, so that I see the comment ;) — nbro, Jun 02 '20 at 15:41
@nbro One of the quizzes in here claim my question's opening statement. https://www.coursera.org/learn/getting-started-with-tensor-flow2 Also, it'd be great if you could go through that paper when you have the time and write an answer based upon it! :) — stoic-santiago, Jun 03 '20 at 10:54

How does SGD escape local minima?

0 Answers0