Why is second-order backpropagation useful?

Question

Raul Rojas's book on Neural Networks dedicates section 8.4.3 to explaining how to do second-order backpropagation, that is, computing the Hessian of the error function with respect to two weights at a time.

What problems are easier to solve using this approach rather than first-order backpropagation?

score 6 · Accepted Answer · edited Dec 12 '20 at 11:26

Second-order optimization algorithms like Hessian optimization have more information on the curvature of the loss function, so converge much, much faster than first-order optimization algorithms like gradient descent. I remember reading somewhere that if you have $n$ weights in the neural network, one iteration of a second-order optimization algorithm will reduce the loss function at approximately the same rate as $n$ iterations of a standard first-order optimization algorithm. However, with recent advancements to gradient descent (momentum, adaptive rates, etc), the difference isn't as large anymore -- @EmmanuelMess pointed out a paper that states:

The performance of the proposed first order and second order methods with adaptive gain (BP-AG, CGFR-AG, BFGS-AG) with standard second order methods without gain (BP, CGFR, BFGS) in terms of speed of convergence evaluated in the number of epochs and CPU time. Based on some simulation results, it’s showed that the proposed algorithm had shown improvements in the convergence rate with 40% faster than other standard algorithms without losing their accuracy.

Here is a great post explaining the background behind the math of why this is the case.

Also, second-order gradients can help the optimizer identify states like saddle points, and help the solver get out of those states. Saddle points give standard gradient descent a lot of issues, as the gradient descent has difficulty and is slow to move out of the saddle point. Fixing the saddle point issue is one of the motivations for improving gradient descent over the last two decades (SDG with momentum, adaptive learning rates, ADAM, etc). More info.

The issue though is that to compute the second-order derivative requires a matrix $n^2$ in size, as opposed to gradient descent which requires a matrix $n$ in size. The memory and computation become intractable for large networks, especially if you have millions of weights.

Some approaches exist which efficiently approximates the second-order optimizations, solving the tractability problem. A popular one is L-BFGS. I haven't played around with it much, but I believe L-BFGS is not as popular as gradient descent algorithms (such as SGD-M, ADAM) because it is still very memory demanding (requires storing about 20-100 previous gradient evaluations), and cannot work in a stochastic context (you cannot sample mini-batches to train on; you must train the entire dataset in one pass per iteration). If those two are not an issue for you, then L-BFGS works pretty well I believe.

The info in this answer seems to be consistent with my knowledge, which is not very extensive though. I would maybe add 1-2 papers that have attempted to apply second-order methods to train NNs and you could briefly describe their results in terms of convergence and memory requirements (to confirm your claims). This would make your answer more reliable. You can also point to (Python) libraries that support the training of NNs with second-order methods (e.g, does TensorFlow support them?). Although this was not the question, this would allow people to play with these methods more rapidly. — nbro, Dec 11 '20 at 10:56
I found a paper analizing the time of first order vs second order bp: https://core.ac.uk/download/pdf/187245955.pdf — EmmanuelMess, Dec 11 '20 at 12:10
"The performance of the proposed first order and second order methods with adaptive gain (BP-AG, CGFR-AG, BFGS-AG) with standard second order methods without gain (BP, CGFR, BFGS) in terms of speed of convergence evaluated in the number of epochs and CPU time. Based on some simulation results, it’s showed that the proposed algorithm had shown improvements in the convergence rate with 40% faster than other standard algorithms without losing their accuracy." — EmmanuelMess, Dec 11 '20 at 12:13
@EmmanuelMess Although the paper may be valuable, it has not been cited many times. Maybe when you look for a paper you should also take a look where or if it's been published and how many times it's been cited. That's something that I had to learn too (and that my advisor also had told me). You can probably find more papers on Google Scholar with keywords such as "second order" and "backpropagation". Anyway, thanks for sharing that paper. If I have time, I will take a look at it and other papers and maybe also provide an answer. — nbro, Dec 11 '20 at 12:40
I have found this article that describes how higher-order derivatives could be computed in TensorFlow. This article talks about BFGS and L-BFGS in the context of TensorFlow Probability. You also have tf.hessians. There are probably other implementations of other specific methods around, such as psgd_tf (which I never tried though). — nbro, Dec 12 '20 at 11:41

Why is second-order backpropagation useful?

1 Answers1

Linked