Complexity of first, second and zero order optimization

Question

I am currently reading Bishop - 'Pattern Recognition and Machine Learning' (2006) where he writes about why using gradient information for optimization is superior to not using it. (p. 239)

Unfortunately, I have no real background in optimization and I can't understand what he means. He makes a local quadratic approximation of an error function $E(\textbf{w})$ around $\bar{\textbf{w}}$ ($\textbf{w}$ is a vector of arbitrary size): $$E(\textbf{w})=E(\bar{\textbf{w}}) + (\textbf{w}-\bar{\textbf{w}})^T\nabla E \bigr|_{\textbf{w} =\bar{\textbf{w}}} +\frac{1}{2}(\textbf{w}-\bar{\textbf{w}})^TH(\textbf{w}-\bar{\textbf{w}})$$

where $H$ is the Hessian matrix. Note that the error function $E(\textbf{w})$ is highly nonlinear and is the loss function of a neural network. We define $\dim \textbf{w}=W$ and then he states that since the Hessian has $O(W^2)$ elements the minimum also depends on the same order of elements.

Cite: 'Without use of gradient information we would expect to have to perform $O(W^2)$ function evaluations each of which would require $O(W)$ steps.' Thus the effort needed would be of order $O(W^3)$.

Now using gradient information:

Cite: Because each evaluation of $\nabla E$ brings $W$ items of information, we might hope to find the minimum of the function in $O(W)$ gradient evaluations. As we shall see, by using error backpropagation, each such evaluation takes only $O(W)$ steps and so the minimum can be found in $O(W^2)$ steps.'

I have absolutely no idea how to obtain these insights which I cited.

I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.

This 3Blue1Brown series may prove helpful, particularly the second episode. — Paul Sinclair, Mar 28 '19 at 00:56

score 1 · Accepted Answer · answered Mar 30 '19 at 19:37

I have absolutely no idea how to obtain these insights which I cited.

Here's what I think he means: we have (by assumption) some quadratic error surface $$ E(w) = \widehat{E} + (w-\widehat{w})^Tb+\frac{1}{2}(w-\widehat{w})^TH (w-\widehat{w}) $$ where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $\nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $\nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.

I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it: $$ E(w) = c + w^Tb + \frac{1}{2}w^T Hw $$ This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $\widehat{w}$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then $$ \nabla E = Hw - b $$ so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $\{ (w_1,E_1),\ldots, (w_n,E_n)\}$, and then solving $$ H^*,b^*,c^* = \arg\min_{H,b} \sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$ which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||\widetilde{E} - X\beta||^2 $, where $\widetilde{E}$ is the vector of measured errors, $\beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ \widetilde{E} = X\beta $ or the normal equations $ (X^TX\beta = X^T \tilde{E}) $. You cannot expect to get a reasonable solution without at least $|\beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |\beta|\in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.

In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.

Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $\nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $\nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $\nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $\nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $\nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).

Ok, finally you asked

I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.

The optimization order refers to the number of derivatives needed. In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory. Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).

Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed. Otherwise we would know in advance how long to train our networks for. :) In some simple cases one can establish convergence theorems though, but this something for optimization theory.

Complexity of first, second and zero order optimization

1 Answers1