52

It seems the Adaptive Moment Estimation (Adam) optimizer nearly always works better (faster and more reliably reaching a global minimum) when minimising the cost function in training neural nets.

Why not always use Adam? Why even bother using RMSProp or momentum optimizers?

PyRsquared
  • 1,604
  • 1
  • 11
  • 18
  • 3
    I don't believe there is any strict, formalized way to support either statement. It's all purely empirical, as error surface is unknown. As a rule of thumb, and purely from m experience, ADAM does well where others fail (instance segmentation), although not without drawbacks (convergence is not monotone) – Alex May 08 '18 at 08:53
  • 6
    Adam is faster to converge. SGD is slower but generalizes better. So at the end it all depends on your particular circumstances. – agcala Mar 21 '19 at 12:10
  • https://en.wikipedia.org/wiki/No_free_lunch_theorem would seem relevant. Different optimization algorithms work better on different problems, and there is no universally superior one. – endolith Nov 28 '22 at 20:55

2 Answers2

42

Here’s a blog post reviewing an article claiming SGD is a better generalized adapter than ADAM.

There is often a value to using more than one method (an ensemble), because every method has a weakness.

Zephyr
  • 997
  • 4
  • 10
  • 20
19

You should also take a look at this post comparing different gradient descent optimizers. As you can see below Adam is clearly not the best optimizer for some tasks as many converge better.