I have absolutely no idea how to obtain these insights which I cited.
Here's what I think he means: we have (by assumption) some quadratic error surface
$$
E(w) = \widehat{E} + (w-\widehat{w})^Tb+\frac{1}{2}(w-\widehat{w})^TH (w-\widehat{w})
$$
where we do not know $b$ or $H$. Altogether, these two parameters have $O(W^2)$ independent values. Thus, we would expect to need to gather at least $O(W^2)$ independent pieces of information in order to exactly determine $b,H$. However, running the forward pass costs $O(W)$. So, altogether, it will cost $T_C=O(W^3)$ to gather enough information to completely specify $b,H$. However, each evaluation of $\nabla E$ gives $O(W)$ pieces of information (since it's a vector with independent components), at a cost of $O(W)$ as well. Hence getting $O(W^2)$ pieces of information will only take $O(W)$ evaluations of $\nabla E$, which each cost $O(W)$, which altogether costs only $T_C=O(W^2)$.
I think the above is sufficient but I think maybe we can give this some extra thought from an optimization perspective. Notice that we can rewrite $E$ by expanding it:
$$ E(w) = c + w^Tb + \frac{1}{2}w^T Hw $$
This is an unconstrained quadratic form (see e.g. here or here; or more generally look at quadratic programming). Let's make the assumption that $H$ is positive definite; this is true if the error surface is "nice" and convex or if we assume the expansion point $\widehat{w}$ is actually at the minimum, in which case the Hessian must be SPD (since it is a minimum). Then
$$ \nabla E = Hw - b $$
so if we knew $H$ and $b$, we could get the minimum by solving the linear system $Hw=b$. This means finding the $O(W^2)$ numbers within $H$ and $b$. Can we do this via measurements from $E$ (i.e. running forward passes with different $w$)? Sure. This corresponds to getting $\{ (w_1,E_1),\ldots, (w_n,E_n)\}$, and then solving
$$ H^*,b^*,c^* = \arg\min_{H,b} \sum_i (E_i - c - w_i^Tb - w_i^T H w_i/2)^2 $$
which is a non-linear least squares problem, but is actually a polynomial (quadratic) least squares regression problem (since $b$ and $H$ are the coefficients of a second-order polynomial if you write out the matrix multiplications). See here for example. We can solve this by ordinary least squares: $ E(w)=||\widetilde{E} - X\beta||^2 $, where $\widetilde{E}$ is the vector of measured errors, $\beta$ are the unfolded parameters ($b,H,c$), and $X$ is a design matrix (computed from $w$ measurements). See this question. Ultimately, the solution comes from solving a linear system $ \widetilde{E} = X\beta $ or
the normal equations $ (X^TX\beta = X^T \tilde{E}) $. You cannot expect to get a reasonable solution without at least $|\beta|$ rows in $X$. In fact, if you could get noiseless samples, you need exactly that many samples to get the solution (else the linear system will be under-determined). But $ |\beta|\in O(W^2) $. Hence you need at least $O(W^2)$ samples to be able to estimate $H$ and $b$, which would let you easily solve for the minimum value.
In other words, if we want to get to the minimum of $E$, there isn't an obvious way to do so without having at least $O(W^2)$ pieces of information.
Incidentally, I'm not a fan of this approach to explaining it this way. I think it's more insightful to consider that if an objective evaluation $L(x)$ (forward pass) costs $cW$ in runtime, then evaluating $\nabla L(x)$ via back-prop (in practice) costs only $rW$, where $ W >> r> c $ with $r$ being a function of your model (backward pass). But, $L(x)$ is only a single number, while $\nabla L$ is $W$ numbers. So we clearly get much more information by repeatedly evaluating $\nabla L$ than $L$ because we get $W$ times more information per evaluation, but computing $\nabla L$ costs much less than $W$ times the computation time of $L$. Notice also that to do some kind of naive search (as in finite difference derivative approximations), we have to compute $L$ at least $O(W)$ times to figure out how to perturb each weight (so costing $O(W^2)$ in total) just for one step, whereas using backprop lets us do one step for only $O(W)$ time computation. Note this requires being able to compute $\nabla L$ efficiently via automatic differentiation, however (as it makes the backward pass cost only a small constant times the cost of the forward pass usually).
Ok, finally you asked
I would also appreciate any literature which explains the number of steps needed for zero, first, second order optimization.
The optimization order refers to the number of derivatives needed.
In general, the higher the order, the more powerful the method is. For instance, a second-order method uses the Hessian (or e.g. the natural gradient), giving $O(W^2)$ pieces of information per step. Why don't we use these for deep learning? The main reason is that computing the Hessian is too expensive in terms of memory.
Another reason is due to the stochastic nature of the error signal (we cannot compute $E$, but rather only a random estimate of it), which causes the derivative estimates to be noisy. The Hessian is even noisier than the gradient however. (Derivative operators amplify noise in signals).
Regarding the number of steps needed, don't be confused by this toy example. For the full error surface defined by a non-linear, non-convex function (like a neural network), there is no way to measure/estimate the number of steps needed.
Otherwise we would know in advance how long to train our networks for. :)
In some simple cases one can establish convergence theorems though, but this something for optimization theory.