5

The Gradient Descent with Momentum method for optimizing parameters is widely used. It has many variations including famous Adam algorithm. Now I do understand them, but where does the word "momentum" come from? I don't see any relation to the physical momentum: $p=mv$ or statistical momentum of order-$k$: $M_k = E(X^k)$.

Neo
  • 251
  • 1
  • 7
  • 2
    Maybe useful: "The name momentum stems from an analogy to momentum in physics: the weight vector $w$, thought of as a particle traveling through parameter space, incurs acceleration from the gradient of the loss ("force"). " – Mauro ALLEGRANZA Jun 30 '21 at 09:24
  • Have a look: https://ruder.io/optimizing-gradient-descent/index.html#momentum –  Jun 30 '21 at 09:31

1 Answers1

6

It is related to the physical momentum. In classic gradient descent the update $\Delta x$ is calculated as $-\alpha \nabla f(x)$ with learning rate $\alpha$ and hence, an update step reads $$ x' = x + \Delta x = x -\alpha \nabla f(x)$$

However, with momentum you remember the update of the last iteration $\Delta x$ and the update consists of a linear combination of the current gradient and the last update, i.e. $\Delta x' = \mu \Delta x - \alpha \nabla f(x')$:

$$ x'' = x' + \Delta x' = x' + \mu \Delta x - \alpha \nabla f(x')$$

The relation to physical momentum comes in by interpreting x as a heavy particle traveling through space and its new direction is a combination of old and new forces acting on it. So, it is more inert to sudden changes in the direction reducing oscillation. In other words, it's travel direction is more robust to sudden changes in the loss landscape. However, it is accelerating faster if the gradients point in the same direction, e.g. "going down hill".

saper0
  • 230
  • This is a good answer. However, can you be more specific as to why it's called momentum? Or a simple question: which term is the "momentum" ? – Neo Jun 30 '21 at 16:19
  • The momentum term is $\mu \Delta x$ as it represents the movement-direction of the "object" in the last time period which we want to "preserve" in a certain sense. I.e. the "object" wants to stay on its path and it needs a new force to push it into another direction. In our case, the "object" are the parameters which move into the opposite direction of their gradient term. – saper0 Jul 27 '22 at 12:27