Second-order optimization algorithms like Hessian optimization have more information on the curvature of the loss function, so converge much, much faster than first-order optimization algorithms like gradient descent. I remember reading somewhere that if you have $n$ weights in the neural network, one iteration of a second-order optimization algorithm will reduce the loss function at approximately the same rate as $n$ iterations of a standard first-order optimization algorithm. However, with recent advancements to gradient descent (momentum, adaptive rates, etc), the difference isn't as large anymore -- @EmmanuelMess pointed out a paper that states:
The performance of the proposed first order and second order methods with adaptive gain (BP-AG, CGFR-AG, BFGS-AG) with standard second order methods without gain (BP, CGFR, BFGS) in terms of speed of convergence evaluated in the number of epochs and CPU time. Based on some simulation results, it’s showed that the proposed algorithm had shown improvements in the convergence rate with 40% faster than other standard algorithms without losing their accuracy.
Here is a great post explaining the background behind the math of why this is the case.
Also, second-order gradients can help the optimizer identify states like saddle points, and help the solver get out of those states. Saddle points give standard gradient descent a lot of issues, as the gradient descent has difficulty and is slow to move out of the saddle point. Fixing the saddle point issue is one of the motivations for improving gradient descent over the last two decades (SDG with momentum, adaptive learning rates, ADAM, etc). More info.

The issue though is that to compute the second-order derivative requires a matrix $n^2$ in size, as opposed to gradient descent which requires a matrix $n$ in size. The memory and computation become intractable for large networks, especially if you have millions of weights.
Some approaches exist which efficiently approximates the second-order optimizations, solving the tractability problem. A popular one is L-BFGS. I haven't played around with it much, but I believe L-BFGS is not as popular as gradient descent algorithms (such as SGD-M, ADAM) because it is still very memory demanding (requires storing about 20-100 previous gradient evaluations), and cannot work in a stochastic context (you cannot sample mini-batches to train on; you must train the entire dataset in one pass per iteration). If those two are not an issue for you, then L-BFGS works pretty well I believe.
tf.hessians
. There are probably other implementations of other specific methods around, such aspsgd_tf
(which I never tried though). – nbro Dec 12 '20 at 11:41