Why use learning rate schedules if weight updates automatically decrease when approaching local optimal?

Question

Andrew Ng said in his slide that: However, there are numerous types of 'learning rate schedules' in TensorFlow that change the learning rate profile as training progresses.

If it's true that these adjustments are redundant then why do we need learning rate schedules?

score 2 · Answer 1 · answered Jan 23 '24 at 08:36

In one hand you have the fact that if the initial step-size is a bit too optimistic, decay will at some point lead to a step-size that works

However, there is a much more theoretical reason for it.
Indeed if you consider a non-smooth function, it's necessary sometimes to converge to decay your stepsize...

As an example, consider $f(x) = |x|$, with an initial guess of $x = 2.5$ and a stepsize of $\alpha = 1$

Well, the gradient of such function is $1$ if $x > 0$ and $-1$ if $x < 0$

Now, apply GD: $x = x - \alpha \nabla_x f(x) \rightarrow x = x - \nabla_x f(x)$, and considering that the gradient on $x = 2.5$ is also 1, becomes $x = x - 1$

Therefore, you will start from 2.5, then next time go to 1.5, then again to 0.5, and at that point, you'll jump to -0.5, and from there jump back to 0.5

As you can see, you will never converge to the optimal $x = 0$, except if you decay your stepsize

Alexander Wan · Answer 2 · 2024-01-23T05:34:08.693

The keyword here is can. Deep learning models have millions of parameters making the loss landscape incredibly complex. Although everything in the slide still applies -- GD will take smaller steps as you get closer to a local minimum and you could theoretically converge using a fixed learning rate -- you almost always need extra tricks for training a neural network.

For example, you might want to set the initial learning rate of your model higher and decrease it as training progresses because the loss landscape for neural networks has many local minima and you don't want to converge prematurely into a suboptimal one. (Although, even this is a huge oversimplification of deep learning optimization.)

Contrast that with a much simpler (e.g., linear) model. You won't see learning rate schedules being used to train these models because the loss landscape will be very simple (in fact, for a 1D linear model, it'll look exactly like the diagram in the slide). In this situation, the learning rate schedule would probably be redundant.

Why use learning rate schedules if weight updates automatically decrease when approaching local optimal?

2 Answers2