Why is RMSProp in many cases converging faster than Momentum?
Momentum:
$$v_{dW} := \beta v_{dw} +(1-\beta)dW$$ $$W := W-\alpha v_{dw}$$
RMSProp:
$$ S_{dw} := B \cdot S_{dw} + (1-B)\cdot (dW)^2$$ $$W := W- \alpha \frac{dW}{\sqrt{S_{dw}}}$$
Where $\alpha$ is the learning rate (0.01 etc), $\beta$ is the momentum term (0.9 etc), similar to B
From my point of view, both momentum and RMSProp have "tendency to keep moving". Well, I can see how RMSprop will naturally accelerate on flat surfaces due to
$$\frac{1}{\sqrt{S_{dw}}}$$
when $S_{dw}$ is small, but is there another benefit that RMSprop provides?