Intuition behind using the inverse of a Hessian matrix for automatically estimating the learning rate (aggression parameter) in gradient descent.

Question

I am reviewing some course material where the lecturer suggests that instead of guessing the learning rate parameter in gradient descent implementation, one could use the inverse of the Hessian multiplied by the negative of the Jacobian, to determine the step-size.

Any help with the intuition behind using the inverse of the Hessian would be much appreciated.

Refer this beautiful post by Keita Kurita (CMU). – Continue2Learn Aug 08 '19 at 14:15 — Continue2Learn, Aug 08 '19 at 14:15
@Continue2Learn - Thanks so much! – sirisha sunkara Aug 16 '19 at 20:11 — sirisha sunkara, Aug 16 '19 at 20:11

Matthieu Brucher · Answer 1 · 2018-12-27T08:08:32.137

It's not an intuition, it's mathematics. And more precisely, the quadratic expansion of a function.

Say you have $f(x)=f(x_n+\Delta x)=f(x_n)+f'(x_n)\Delta x + f''(x_n)\Delta x^2$ and you want to get to a point where $f'(x) = 0$, then you will use $\Delta x=-\frac{f'(x_n)}{f''(x_n)}$.

Generalizing to n dimension, $f''$ is the Hessian of the equation.

So of course this works perfectly for least squares, it will only "work" close to the solution for non quadratic functions.

Intuition behind using the inverse of a Hessian matrix for automatically estimating the learning rate (aggression parameter) in gradient descent.

1 Answers1