Why RMSProp converges faster than Momentum?

Question

Why is RMSProp in many cases converging faster than Momentum?

Momentum:

$$v_{dW} := \beta v_{dw} +(1-\beta)dW$$ $$W := W-\alpha v_{dw}$$

RMSProp:

$$ S_{dw} := B \cdot S_{dw} + (1-B)\cdot (dW)^2$$ $$W := W- \alpha \frac{dW}{\sqrt{S_{dw}}}$$

Where $\alpha$ is the learning rate (0.01 etc), $\beta$ is the momentum term (0.9 etc), similar to B

From my point of view, both momentum and RMSProp have "tendency to keep moving". Well, I can see how RMSprop will naturally accelerate on flat surfaces due to

$$\frac{1}{\sqrt{S_{dw}}}$$

when $S_{dw}$ is small, but is there another benefit that RMSprop provides?

Green Falcon · Answer 1 · 2020-09-29T18:11:27.103

3

The basic intuition is that you should not have the same learning rate for different dimensions. For instance, you can have a high slope in one direction but not for another. Consequently, you should not have the same speed for the two directions. Momentum adds acceleration. Suppose gradient is your instant velocity and the average is your average velocity. Momentum is actually viscosity or somehow friction. Suppose that you are near your optimal points, your gradients become zero and you have low average which means your speed changes slowly. They have both alpha term but what is going to be used is the running average, just a kind of average which is simple to be calculated. Take a look at here and here for making an analogy.

edited Sep 29 '20 at 18:11

answered Apr 21 '18 at 15:38

Green Falcon

14,058
9
57
98

I don't think this is a proper analogy. The gradient is already an instantaneous rate of change, i.e. acceleration. Momentum adds velocity and mass (beta) so that acceleration can accumulate. – Austin Apr 03 '20 at 01:53
@Austin basically the first-order derivative, gradient here, corresponds to velocity and second-order derivative, Hessian matrix, corresponds to acceleration. Did you see the links provided above? – Green Falcon Apr 03 '20 at 04:00
Your analogy is backwards. Ng literally says, "... then these derivative terms you can think of as providing acceleration...", and "and these momentum terms you can think of as representing velocity". I think you're confusing the idea that momentum ADDs acceleration with the idea that momentum IS acceleration. Momentum is mass*velocity. Vanilla SGD uses ONLY acceleration, but it has no velocity, and therefore no momentum. – Austin Apr 04 '20 at 15:50
and with momentum accumulation, the terminal velocity is proportional to 1/(1-beta). – Austin Apr 04 '20 at 15:58
@Austin this file may help you. – Green Falcon Apr 04 '20 at 17:01
Where are you seeing a Hessian in any of these equations? The dw^2 is not a second derivative, it's an element-wise product. I think this will help you. – Austin Apr 04 '20 at 17:55
I've provided that to show you the intuition of the second derivative. If your think my answer is incorrect, I'm open to criticism! Please go ahead and put your answer. It may help everyone. – Green Falcon Apr 04 '20 at 18:23

Varun Bajpai · Answer 2 · 2020-02-02T18:15:56.740

Momentum is linear and provides speed to the update

RMSprop contributes the exponentially decaying average of past "squared gradients"

In RMS Prop By using the average, we actually try to diminish the vertical movement because they sum up to 0(approximately) while averaging.

RMS provides average to the update

Adam uses RMS prop and Momentum Speed and Average of update combined together, On an average it will speed up the direction in which more update is needed

All three are faster than Stochastic Gradient Decent without Exponential Weighted Average, Worst Case use Momentum, Dont go for normal weight updates

Why RMSProp converges faster than Momentum?

2 Answers2